Introduction

This is a continuation of my MIS 665 Mid-Term project and will be utilized for my Final Project. I will be using what I learned in the second half of the semester. In this project, I will demonstrate my skills in modeling, evaluation and deployment, using regression, classification and clustering.

Honor Code

Austin, N. "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."

Import Data

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score

import statsmodels.api as sm
from statsmodels.formula.api import ols

# model validation
from sklearn.model_selection import train_test_split

import statsmodels.api as sm
from statsmodels.formula.api import ols

#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest

# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE

#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import SVG
#from graphviz import Source
from IPython.display import display
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier

#for validating your classification model
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV 
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#pip install scikit-plot (optional)
import scikitplot as skplt

import warnings
warnings.filterwarnings("ignore")

from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
In [2]:
# load csv file
df_full = pd.read_csv('data/movie_metadata.csv')
df_full.head()
Out[2]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

Business Understanding

Answer the following questions:

Project Goals:
  • Predict movie succes
  • Understand the background of this prediction problem

What kind of data would I collect?:
  • Facebook likes
  • Director
  • Movie duration
  • Genre
  • Number of voted users
  • Number of critics

What variables are highly correlated to imdb score?
  • We will use corr ( ) to verify this.
  • We will use imdb score to measure success.

Data Understanding

Describe the data

  • The data contains 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries.
  • There are 2399 unique director names, and thousands of actors/actresses.

-- Source: https://data.world/popculture/imdb-5000-movie-dataset

Variable Name Description
movie_title Title of the Movie
duration Duration in minutes
director_name Name of the Director of the Movie
director_facebook_likes Number of likes of the Director on his Facebook Page
actor_1_name Primary actor starring in the movie
actor_1_facebook_likes Number of likes of the Actor_1 on his/her Facebook Page
actor_2_name Other actor starring in the movie
actor_2_facebook_likes Number of likes of the Actor_2 on his/her Facebook Page
actor_3_name Other actor starring in the movie
actor_3_facebook_likes Number of likes of the Actor_3 on his/her Facebook Page
num_user_for_reviews Number of users who gave a review
num_critic_for_reviews Number of critical reviews on imdb
num_voted_users Number of people who voted for the movie
cast_total_facebook_likes Total number of facebook likes of the entire cast of the movie
movie_facebook_likes Number of Facebook likes in the movie page
plot_keywords Keywords describing the movie plot
facenumber_in_poster Number of the actor who featured in the movie poster
color Film colorization. ‘Black and White’ or ‘Color’
genres Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’
title_year The year in which the movie is released (1916:2016)
language English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc
country Country where the movie is produced
content_rating Content rating of the movie
aspect_ratio Aspect ratio the movie was made in
movie_imdb_link IMDB link of the movie
gross Gross earnings of the movie in Dollars
budget Budget of the movie in Dollars
imdb_score IMDB Score of the movie on IMDB
In [3]:
# Let's look at the columns as seen in the data set.
# We are using this verify that the data set we pulled in the same as the source data stated it should be.
for col in df_full.columns: 
    print(col)
color
director_name
num_critic_for_reviews
duration
director_facebook_likes
actor_3_facebook_likes
actor_2_name
actor_1_facebook_likes
gross
genres
actor_1_name
movie_title
num_voted_users
cast_total_facebook_likes
actor_3_name
facenumber_in_poster
plot_keywords
movie_imdb_link
num_user_for_reviews
language
country
content_rating
budget
title_year
actor_2_facebook_likes
imdb_score
aspect_ratio
movie_facebook_likes

Identify data quality issues.

In [4]:
# How many records are in the data set?
len(df_full)
Out[4]:
5043
In [5]:
# Look at the top 5 records to see what we might find needs cleaned up.
df_full.head()
Out[5]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

  • Gnere appears to be concatenated by | with multiple values. We will need to clean this up.
  • We could also clean up the plot_keywords, if we feel this effects the idmb score
  • We have null values in some of the columns.
In [6]:
# Do we have any null values?
df_full.isnull().sum().sort_values(ascending=False)
Out[6]:
gross                        884
budget                       492
aspect_ratio                 329
content_rating               303
plot_keywords                153
title_year                   108
director_name                104
director_facebook_likes      104
num_critic_for_reviews        50
actor_3_name                  23
actor_3_facebook_likes        23
num_user_for_reviews          21
color                         19
duration                      15
facenumber_in_poster          13
actor_2_name                  13
actor_2_facebook_likes        13
language                      12
actor_1_name                   7
actor_1_facebook_likes         7
country                        5
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64
  • It is intersting to see budget and gross have so many null values.
In [7]:
# Let's see how this looks on a bar chart.
df_full.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8, 8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
  • Several columns have null values, we will need to clean these up.
  • We might have to focus on the gross, budget, aspect ration and content rating
  • Can we replace any of the other nulls with values, such as the mean or 0?
In [8]:
# Any other data issues?

# How many duplicated rows are in the dataset?
len(df_full[df_full.duplicated() == True])
Out[8]:
45
  • 45 rows are duplicates of other rows in the data set.

Identify data types

In [9]:
# What are the data types?
df_full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
In [10]:
# Let's describe the numbers in the dataset.
df_full.describe()
Out[10]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000

Identify value counts of a select list of columns considered to be important to predict a movie's success.

In [11]:
# Do we see any correlation issues between the imdb_score and other factors?
df_full.corr()
Out[11]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews 1.000000 0.258486 0.180674 0.271646 0.190016 0.480601 0.624943 0.263203 -0.033897 0.609387 0.119994 0.275707 0.282306 0.305303 -0.049786 0.683176
duration 0.258486 1.000000 0.173296 0.123558 0.088449 0.250298 0.314765 0.123074 0.013469 0.328403 0.074276 -0.135038 0.131673 0.261662 -0.090071 0.196605
director_facebook_likes 0.180674 0.173296 1.000000 0.120199 0.090723 0.144945 0.297057 0.119549 -0.041268 0.221890 0.021090 -0.063820 0.119601 0.170802 0.001642 0.162048
actor_3_facebook_likes 0.271646 0.123558 0.120199 1.000000 0.249927 0.308026 0.287239 0.473920 0.099368 0.230189 0.047451 0.096137 0.559662 0.052633 -0.003366 0.278844
actor_1_facebook_likes 0.190016 0.088449 0.090723 0.249927 1.000000 0.154468 0.192804 0.951661 0.072257 0.145461 0.022639 0.086873 0.390487 0.076099 -0.020049 0.135348
gross 0.480601 0.250298 0.144945 0.308026 0.154468 1.000000 0.637271 0.247400 -0.027755 0.559958 0.102179 0.030886 0.262768 0.198021 0.069346 0.378082
num_voted_users 0.624943 0.314765 0.297057 0.287239 0.192804 0.637271 1.000000 0.265911 -0.026998 0.798406 0.079621 0.007397 0.270790 0.410965 -0.014761 0.537924
cast_total_facebook_likes 0.263203 0.123074 0.119549 0.473920 0.951661 0.247400 0.265911 1.000000 0.091475 0.206923 0.036557 0.109971 0.628404 0.085787 -0.017885 0.209786
facenumber_in_poster -0.033897 0.013469 -0.041268 0.099368 0.072257 -0.027755 -0.026998 0.091475 1.000000 -0.069018 -0.019559 0.061504 0.071228 -0.062958 0.013713 0.008918
num_user_for_reviews 0.609387 0.328403 0.221890 0.230189 0.145461 0.559958 0.798406 0.206923 -0.069018 1.000000 0.084292 -0.003147 0.219496 0.292475 -0.024719 0.400594
budget 0.119994 0.074276 0.021090 0.047451 0.022639 0.102179 0.079621 0.036557 -0.019559 0.084292 1.000000 0.045726 0.044236 0.030688 0.006598 0.062039
title_year 0.275707 -0.135038 -0.063820 0.096137 0.086873 0.030886 0.007397 0.109971 0.061504 -0.003147 0.045726 1.000000 0.101890 -0.209167 0.159973 0.218678
actor_2_facebook_likes 0.282306 0.131673 0.119601 0.559662 0.390487 0.262768 0.270790 0.628404 0.071228 0.219496 0.044236 0.101890 1.000000 0.083808 -0.007783 0.243487
imdb_score 0.305303 0.261662 0.170802 0.052633 0.076099 0.198021 0.410965 0.085787 -0.062958 0.292475 0.030688 -0.209167 0.083808 1.000000 0.059445 0.247049
aspect_ratio -0.049786 -0.090071 0.001642 -0.003366 -0.020049 0.069346 -0.014761 -0.017885 0.013713 -0.024719 0.006598 0.159973 -0.007783 0.059445 1.000000 0.025737
movie_facebook_likes 0.683176 0.196605 0.162048 0.278844 0.135348 0.378082 0.537924 0.209786 0.008918 0.400594 0.062039 0.218678 0.243487 0.247049 0.025737 1.000000
  • Every column has a positive correlation to imdb score except for num user for reviews, and actor 2 facebook likes.
  • Number of voted users has the highest positive correlation to imbd score.
In [12]:
plt.figure(figsize=(12,12))
sns.heatmap(df_full.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Blues')
plt.title("Correlation on all columns");
In [13]:
# What is the break down of director facebook likes?
df_full['director_facebook_likes'].value_counts().sort_values(ascending=False).head().reset_index()
Out[13]:
index director_facebook_likes
0 0.0 907
1 3.0 70
2 6.0 66
3 7.0 64
4 2.0 63
In [14]:
# What is the break down of duration likes?
df_full['duration'].value_counts().sort_values(ascending=False).head().reset_index()
Out[14]:
index duration
0 90.0 161
1 100.0 141
2 101.0 139
3 98.0 135
4 97.0 131
In [15]:
# What is the break down of number of voted users likes?
df_full['num_voted_users'].value_counts().sort_values(ascending=False).head().reset_index()
Out[15]:
index num_voted_users
0 57 5
1 6 4
2 62 3
3 38 3
4 8 3
In [16]:
# What is the break down of number of critics for reviews likes?
df_full['num_critic_for_reviews'].value_counts().sort_values(ascending=False).head().reset_index()
Out[16]:
index num_critic_for_reviews
0 1.0 43
1 9.0 37
2 5.0 36
3 10.0 35
4 8.0 35
In [17]:
# What is the break down of movie facebook likes?
df_full['movie_facebook_likes'].value_counts().sort_values(ascending=False).head().reset_index()
Out[17]:
index movie_facebook_likes
0 0 2181
1 1000 109
2 11000 83
3 10000 81
4 12000 62

Data Preparation

In [18]:
# First let's check out total row count again.
len(df_full)
Out[18]:
5043

Cleaning

In [19]:
# Let's remove the duplicate rows, so they don't skew the data.
# Move from the df_full dataset to df
df = df_full.drop_duplicates()
len(df)
Out[19]:
4998
  • We now have a total of 4998 after we dropped some duplicate rows.
In [20]:
# Earlier we found that we had null values.
# Let's go ahead and drop the nulls, so that correlate against rows with the most data.
# The following columns had the most nulls
# gross  884
# budget 492

df = df[df['gross'].notnull()]
df = df[df['budget'].notnull()]
In [21]:
# What is our count after dropping nulls in the gross column?
len(df)
Out[21]:
3857
  • We are now left with 3857 records in the data set.
In [22]:
# How do our nulls look now?
df.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8, 8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
  • Because aspect ratio and content_rating does not look to effect but only .02 % of the data set, we will leave it.
  • But, we need to look to see if we can replace some values, now that we have a good set of data.
In [23]:
# What are our true counts of null data now?
df.isnull().sum().sort_values(ascending=False)
Out[23]:
aspect_ratio                 74
content_rating               51
plot_keywords                31
actor_3_facebook_likes       10
actor_3_name                 10
facenumber_in_poster          6
actor_2_name                  5
actor_2_facebook_likes        5
actor_1_facebook_likes        3
actor_1_name                  3
language                      3
color                         2
duration                      1
num_critic_for_reviews        1
genres                        0
director_facebook_likes       0
director_name                 0
gross                         0
movie_facebook_likes          0
movie_title                   0
num_voted_users               0
movie_imdb_link               0
num_user_for_reviews          0
country                       0
budget                        0
title_year                    0
imdb_score                    0
cast_total_facebook_likes     0
dtype: int64
In [24]:
# Let's descrbibe the data that is left to set some values, based on mean()
df.describe()
Out[24]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 3856.000000 3856.000000 3857.000000 3847.000000 3854.000000 3.857000e+03 3.857000e+03 3857.000000 3851.000000 3857.000000 3.857000e+03 3857.000000 3852.000000 3857.000000 3783.000000 3857.000000
mean 162.894450 109.901193 783.721027 747.290096 7573.593669 5.091264e+07 1.023181e+05 11227.824734 1.376785 326.388644 4.520189e+07 2003.068188 1959.018432 6.463806 2.109413 9081.565725
std 123.953669 22.740169 3025.924047 1841.623945 15403.866813 6.930377e+07 1.502522e+05 18916.591384 2.055483 407.839385 2.233096e+08 10.005510 4472.171290 1.053697 0.353209 21267.885011
min 1.000000 34.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1920.000000 0.000000 1.600000 1.180000 0.000000
25% 72.000000 95.000000 10.000000 183.000000 721.000000 6.754898e+06 1.726100e+04 1815.000000 0.000000 102.000000 1.000000e+07 1999.000000 362.000000 5.900000 1.850000 0.000000
50% 134.000000 106.000000 58.000000 427.000000 1000.000000 2.782987e+07 5.038900e+04 3871.000000 1.000000 202.000000 2.400000e+07 2005.000000 664.000000 6.600000 2.350000 206.000000
75% 221.000000 120.000000 222.000000 685.000000 12000.000000 6.545231e+07 1.239400e+05 15944.000000 2.000000 390.000000 5.000000e+07 2010.000000 971.000000 7.200000 2.350000 11000.000000
max 813.000000 330.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.300000 16.000000 349000.000000
  • We can use this describe to decide which columns we should analyze.

Transforming

In [25]:
# Replace values, let's go with the mean for all floats.
df = df.fillna({'num_critic_for_reviews': 163.0})
df = df.fillna({'duration': 110.0})
df = df.fillna({'actor_1_facebook_likes': 7576.0})
df = df.fillna({'actor_2_facebook_likes': 1959.0})
df = df.fillna({'actor_3_facebook_likes': 747.0})
df = df.fillna({'facenumber_in_poster': 1.4})
df = df.fillna({'aspect_ratio': 2.1})
In [26]:
df.isnull().sum().sort_values(ascending=False)
Out[26]:
content_rating               51
plot_keywords                31
actor_3_name                 10
actor_2_name                  5
language                      3
actor_1_name                  3
color                         2
movie_title                   0
director_name                 0
num_critic_for_reviews        0
duration                      0
director_facebook_likes       0
actor_3_facebook_likes        0
actor_1_facebook_likes        0
gross                         0
genres                        0
movie_facebook_likes          0
num_voted_users               0
aspect_ratio                  0
facenumber_in_poster          0
movie_imdb_link               0
num_user_for_reviews          0
country                       0
budget                        0
title_year                    0
actor_2_facebook_likes        0
imdb_score                    0
cast_total_facebook_likes     0
dtype: int64
In [27]:
df.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8,8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
  • Now it appears we have some fairly clean data. Let's see what else we need to do?
In [28]:
# For classification later, let's create a dataset
dfclass = df
dfclass = dfclass.dropna()

dfclass.head()
Out[28]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Action|Adventure|Sci-Fi ... 738.0 English USA PG-13 263700000.0 2012.0 632.0 6.6 2.35 24000

5 rows × 28 columns

Add columns

In [29]:
# We found that genre had multiple results concatenated by |, let's fix this.
# We will do this by using dummy columns to split them but keep them in the dataset.

# -- Pulled from the canvas forum. :)
df = df.join(df.pop('genres').str.get_dummies('|'))
df.head()
Out[29]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name ... Music Musical Mystery Romance Sci-Fi Short Sport Thriller War Western
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder ... 0 0 0 0 1 0 0 0 0 0
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp ... 0 0 0 0 0 0 0 0 0 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz ... 0 0 0 0 0 0 0 1 0 0
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy ... 0 0 0 0 0 0 0 1 0 0
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara ... 0 0 0 0 1 0 0 0 0 0

5 rows × 50 columns

  • We now have columns for the genres!
In [30]:
# Let's cteate a return_on_investment column, to see if this has a high correlation in later analysis.
df['return_on_investment'] = (df['gross']/df['budget'])*100
df.return_on_investment.head().reset_index()
Out[30]:
index return_on_investment
0 0 320.888543
1 1 103.134717
2 2 81.662929
3 3 179.252257
4 5 27.705225
  • We can now see the return on investment in one column.

Remove Columns

  • Let's remove some columns that we don't need for this research.
In [31]:
# Is color important? 
df['color'].value_counts().sort_values(ascending=False).head().reset_index()
Out[31]:
index color
0 Color 3725
1 Black and White 130
  • Over 96% of of the of the movies are in color, so this won't help us determine correlation. Let's drop color.
In [32]:
df= df.drop('color', axis=1)
In [33]:
# Is language important?
df['language'].value_counts().sort_values(ascending=False).head().reset_index()
Out[33]:
index language
0 English 3674
1 French 37
2 Spanish 26
3 Mandarin 14
4 German 13
  • Over 98% of languages are English. This seems like a non factor. Let's remove language.
In [34]:
df= df.drop('language', axis=1)
In [35]:
# Is facenumber in poster important?
df['facenumber_in_poster'].value_counts().sort_values(ascending=False).head().reset_index()
Out[35]:
index facenumber_in_poster
0 0.0 1631
1 1.0 980
2 2.0 541
3 3.0 296
4 4.0 165

Business Intelligence

  • Pivot tables
  • Data visualization with business questions
In [36]:
# Let's count and display movie titles by year.
df['title_year'].hist(bins=150)
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("All movies listed by year");
  • Should we remove all movies before 1980 since a majority are after 1980?
In [37]:
df = df.loc[df['title_year'] >= 1980]
df.groupby('title_year').size().plot()
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("Only movies listed by year after 1980");
  • Sweet, now we only have data on or after 1980.
In [38]:
sns.distplot(df.title_year)
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("Standard Normalization chart of movies listed by year after 1980");
  • This is an intersting way to view the same data.
In [39]:
# Top 20 movies titles
df.groupby('movie_title')['gross'].sum().sort_values(ascending=False).head(20).plot(kind='barh', figsize=(10,10));
plt.xlabel('Gross $million')
plt.ylabel('Movie Title')
plt.title("Top 20 movies based on gross");
In [40]:
# IMDB_SCORE has quite a few scores, so let's break these down into BINS for easier plotting. 
# create a new df
df_score = df
# setting my own values for bins

df_score['imdbscores_bins'] = pd.cut(df['imdb_score'], bins=[0, 2, 4, 6, 8, 10], labels=[1,2,3,4,5])
df_score.head()
Out[40]:
director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name movie_title ... Mystery Romance Sci-Fi Short Sport Thriller War Western return_on_investment imdbscores_bins
0 James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder Avatar ... 0 0 1 0 0 0 0 0 320.888543 4
1 Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp Pirates of the Caribbean: At World's End ... 0 0 0 0 0 0 0 0 103.134717 4
2 Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz Spectre ... 0 0 0 0 0 1 0 0 81.662929 4
3 Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy The Dark Knight Rises ... 0 0 0 0 0 1 0 0 179.252257 5
5 Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara John Carter ... 0 0 1 0 0 0 0 0 27.705225 4

5 rows × 50 columns

  • Thank you Dr. Chae for this helpful code. Instead of using multiple imdb scores to make the plots hard to read, we can now see our plots on imdb scores in 5 easy to read bins.
In [41]:
sns.catplot("imdbscores_bins", "gross", data=df, kind="violin", 
               height=7, aspect=2, palette="muted")
plt.xlabel('IMBD Scores')
plt.ylabel('Gross')
plt.title("IDMB Scores based on gross");
  • The above violin plots show that the high grossing films were fewer between across all imdb scores, than wer the lower grossing films. Does this mean that there were not that many high grossing films?
In [42]:
sns.lmplot("imdbscores_bins", "movie_facebook_likes", df, order=2)
plt.xlabel('IDMB_SCORES')
plt.ylabel('Movie Facebook Likes')
plt.title("Facebook likes based on IMBD Scores");
  • It appears that the imdb score of around 4 received the most movie facebook likes.
In [43]:
sns.lmplot("imdbscores_bins", "duration", df, order=2)
plt.xlabel('IMBD Scores')
plt.ylabel('Duration')
plt.title("IMDB Scores based on Duration");
  • It is interesting to see that duration, related to facebook likes are oddly correlated.
In [44]:
# violin plot

sns.catplot("imdbscores_bins", "duration", data=df, kind="violin", 
               height=8, aspect=2, palette="muted")
plt.xlabel('IMBD Scores')
plt.ylabel('Duration')
plt.title("Duration by IMDB Scores");
  • This shows me that the duration is pretty average regardles of what the imdb scores is.
In [45]:
plt.scatter(df['imdb_score'], df['movie_facebook_likes'])
plt.xlabel('IMDB Score')
plt.ylabel('Movie Facebook Likes')
plt.title("IMBD Score in relation to movie facebook likes");
  • The higher the imdb score, clearly the more movie facebook likes. If you can get a high score, your fans will be happy. Try to stay in the 7-8 range.

Correlation Analysis

  • corr()
  • heatmap()
  • interpretation

Correlation

In [46]:
# Earlier we peformed correlation to see what fields, we wanted to use for value_counts. 
# Let's work to research correlation between several fields.
# We can now take a look at correlation on our imdb_scores and genres
df_corr = df_full[['imdb_score','genres']]
df_corr.head()

# Now let's split the genres by category

# -- Pulled from the canvas forum. :)
df_genres = df_corr.join(df_corr.pop('genres').str.get_dummies('|'))
df_genres.head()
Out[46]:
imdb_score Action Adventure Animation Biography Comedy Crime Documentary Drama Family ... Mystery News Reality-TV Romance Sci-Fi Short Sport Thriller War Western
0 7.9 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
1 7.1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 6.8 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 8.5 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 7.1 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 27 columns

In [47]:
# Now let's do correlation on these fewer columns
df_genres.corr()
Out[47]:
imdb_score Action Adventure Animation Biography Comedy Crime Documentary Drama Family ... Mystery News Reality-TV Romance Sci-Fi Short Sport Thriller War Western
imdb_score 1.000000 -0.097872 -0.000407 0.026721 0.156310 -0.168619 0.050437 0.102859 0.294229 -0.061042 ... 0.013053 0.023664 -0.029960 0.003983 -0.053158 -0.001740 0.028191 -0.070857 0.117279 0.030816
Action -0.097872 1.000000 0.316229 -0.020610 -0.094847 -0.169083 0.154574 -0.076105 -0.234371 -0.066612 ... -0.047899 -0.013283 -0.010844 -0.170086 0.281416 -0.017151 -0.039522 0.280216 0.035922 0.026893
Adventure -0.000407 0.316229 1.000000 0.299198 -0.069328 -0.033567 -0.153022 -0.060808 -0.238828 0.313683 ... -0.066083 -0.011548 -0.009428 -0.112257 0.233736 0.001383 -0.055843 -0.039125 0.005139 0.045723
Animation 0.026721 -0.020610 0.299198 1.000000 -0.043863 0.157785 -0.089255 -0.029139 -0.171664 0.533854 ... -0.049649 -0.005478 -0.004472 -0.074239 0.060742 -0.007073 -0.013598 -0.123403 -0.028697 -0.011177
Biography 0.156310 -0.094847 -0.069328 -0.043863 1.000000 -0.139947 -0.008122 0.044146 0.210794 -0.064717 ... -0.079559 -0.006059 -0.004947 -0.017032 -0.092645 -0.007824 0.151912 -0.094377 0.078493 0.002248
Comedy -0.168619 -0.169083 -0.033567 0.157785 -0.139947 1.000000 -0.081863 -0.082920 -0.246294 0.205168 ... -0.190352 -0.018746 0.005310 0.176577 -0.088571 0.001877 0.005370 -0.357295 -0.120537 -0.056799
Crime 0.050437 0.154574 -0.153022 -0.089255 -0.008122 -0.081863 1.000000 -0.045330 0.071549 -0.134400 ... 0.119896 0.010055 -0.009215 -0.123389 -0.129659 -0.014574 -0.072774 0.346914 -0.081624 -0.011743
Documentary 0.102859 -0.076105 -0.060808 -0.029139 0.044146 -0.082920 -0.045330 1.000000 -0.127662 -0.046293 ... -0.047680 0.155605 -0.003123 -0.083151 -0.054530 0.036233 0.039136 -0.094840 0.025058 -0.021957
Drama 0.294229 -0.234371 -0.238828 -0.171664 0.210794 -0.246294 0.071549 -0.127662 1.000000 -0.181126 ... 0.003733 -0.008838 -0.000573 0.160637 -0.203377 -0.032423 0.064632 -0.025441 0.158674 0.017637
Family -0.061042 -0.066612 0.313683 0.533854 -0.064717 0.205168 -0.134400 -0.046293 -0.181126 1.000000 ... -0.068619 -0.008501 -0.006941 -0.050654 0.022035 0.009300 0.035226 -0.201545 -0.066827 -0.025565
Fantasy -0.044543 0.054351 0.272621 0.243906 -0.084332 0.030918 -0.147669 -0.054188 -0.191964 0.318924 ... -0.007081 -0.009050 -0.007389 -0.045401 0.030621 -0.011686 -0.058736 -0.087613 -0.053711 -0.038666
Film-Noir 0.036544 -0.018790 -0.016336 -0.007749 -0.008572 -0.026518 0.029320 -0.005411 0.010517 -0.012026 ... 0.046292 -0.000842 -0.000687 -0.004406 -0.012874 -0.001087 -0.006678 0.029745 -0.007248 -0.004833
Game-Show -0.044341 -0.007667 -0.006666 -0.003162 -0.003498 -0.010821 -0.006515 -0.002208 -0.014494 -0.004907 ... -0.004672 -0.000344 0.707037 0.026555 -0.005253 -0.000444 -0.002725 -0.008778 -0.002957 -0.001972
History 0.117962 -0.007919 0.015802 -0.037098 0.298959 -0.138276 -0.064236 0.032874 0.165030 -0.062440 ... -0.065292 0.035943 -0.004121 -0.013133 -0.077175 -0.006518 0.008196 -0.066611 0.334222 0.029242
Horror -0.189001 -0.061645 -0.111222 -0.070925 -0.085533 -0.154504 -0.114829 -0.055693 -0.230977 -0.111631 ... 0.182976 -0.008666 -0.007075 -0.162556 0.099807 0.008786 -0.068731 0.191753 -0.071467 -0.049744
Music -0.005961 -0.095870 -0.066568 -0.005842 0.086484 0.037794 -0.056082 0.082701 0.054960 0.018459 ... -0.053380 -0.005136 -0.004193 0.054719 -0.060502 -0.006632 -0.040733 -0.118062 -0.029535 -0.022319
Musical 0.009536 -0.083342 0.015548 0.131708 0.017684 0.053988 -0.053028 -0.025705 0.000254 0.166719 ... -0.046078 -0.004000 -0.003266 0.084091 -0.045984 0.034299 -0.031723 -0.096653 -0.022078 -0.004874
Mystery 0.013053 -0.047899 -0.066083 -0.049649 -0.079559 -0.190352 0.119896 -0.047680 0.003733 -0.068619 ... 1.000000 -0.008094 -0.006608 -0.100597 0.032268 -0.010451 -0.064193 0.316460 -0.049874 -0.031967
News 0.023664 -0.013283 -0.011548 -0.005478 -0.006059 -0.018746 0.010055 0.155605 -0.008838 -0.008501 ... -0.008094 1.000000 -0.000486 -0.012939 -0.009101 -0.000769 -0.004721 -0.015207 -0.005123 -0.003417
Reality-TV -0.029960 -0.010844 -0.009428 -0.004472 -0.004947 0.005310 -0.009215 -0.003123 -0.000573 -0.006941 ... -0.006608 -0.000486 1.000000 0.037559 -0.007430 -0.000627 -0.003854 -0.012415 -0.004183 -0.002789
Romance 0.003983 -0.170086 -0.112257 -0.074239 -0.017032 0.176577 -0.123389 -0.083151 0.160637 -0.050654 ... -0.100597 -0.012939 0.037559 1.000000 -0.124676 0.013737 -0.015286 -0.213156 0.010109 0.002467
Sci-Fi -0.053158 0.281416 0.233736 0.060742 -0.092645 -0.088571 -0.129659 -0.054530 -0.203377 0.022035 ... 0.032268 -0.009101 -0.007430 -0.124676 1.000000 0.007490 -0.055945 0.110141 -0.075323 -0.043421
Short -0.001740 -0.017151 0.001383 -0.007073 -0.007824 0.001877 -0.014574 0.036233 -0.032423 0.009300 ... -0.010451 -0.000769 -0.000627 0.013737 0.007490 1.000000 -0.006096 -0.005600 -0.006616 -0.004412
Sport 0.028191 -0.039522 -0.055843 -0.013598 0.151912 0.005370 -0.072774 0.039136 0.064632 0.035226 ... -0.064193 -0.004721 -0.003854 -0.015286 -0.055945 -0.006096 1.000000 -0.111131 -0.030062 -0.027098
Thriller -0.070857 0.280216 -0.039125 -0.123403 -0.094377 -0.357295 0.346914 -0.094840 -0.025441 -0.201545 ... 0.316460 -0.015207 -0.012415 -0.213156 0.110141 -0.005600 -0.111131 1.000000 -0.051824 -0.051909
War 0.117279 0.035922 0.005139 -0.028697 0.078493 -0.120537 -0.081624 0.025058 0.158674 -0.066827 ... -0.049874 -0.005123 -0.004183 0.010109 -0.075323 -0.006616 -0.030062 -0.051824 1.000000 0.020838
Western 0.030816 0.026893 0.045723 -0.011177 0.002248 -0.056799 -0.011743 -0.021957 0.017637 -0.025565 ... -0.031967 -0.003417 -0.002789 0.002467 -0.043421 -0.004412 -0.027098 -0.051909 0.020838 1.000000

27 rows × 27 columns

  • Of the following genres the following have positive correlation to imdb score.
     - Animation
     - Biography
     - Crime
     - Documentary
     - Drama
     - Film-Noir
     - History
     - Musical
     - Mystery
     - News
     - Romance
     - Sport
     - War
     - Western
In [48]:
# Let's take a look at doing a heat map on a few of these fields. 
plt.figure(figsize=(12,12))
sns.heatmap(df_genres.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Purples')
plt.title("Correlation on all genres");
  • Even though we were trying to identify what correlated high to imdb score, we can quickly see based on the heat map above, that Reality-Tv / Game Show is high, and Family Animation is also high, related to each other.
In [49]:
# -- Pulled from the canvas forum. :)

# Movie_Facebook_Likes has quite a few scores, so let's break these down into BINS for easier plotting. 
# create a new df
df_likes = df_full
# setting my own values for bins

df_likes['movie_fb_likes_bins'] = pd.cut(df['movie_facebook_likes'], bins=[0, 10000, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000, 225000, 250000, 275000, 300000, 325000, 350000], labels=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
df_likes.head()

# Now let's pull the movie_facebook_likes out and see what they correlate too.
df_likes.groupby('movie_fb_likes_bins').mean()
Out[49]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_fb_likes_bins
1 83.966024 103.438935 350.600000 494.130074 4799.877645 2.452572e+07 33375.251376 7183.014679 1.375229 163.663303 4.701950e+07 2001.644954 1107.704420 5.961835 2.058475 1608.278899
2 230.471264 113.358238 1226.363985 996.042226 9732.637931 6.241834e+07 169430.132184 14385.839080 1.310212 455.038314 4.717901e+07 2007.404215 2667.468330 6.867241 2.159981 16388.888889
3 320.586873 116.104247 1558.926641 1218.169884 10531.046332 8.319797e+07 231560.752896 16190.494208 1.662791 510.239382 5.837618e+07 2010.444015 3326.980695 7.075290 2.181853 35343.629344
4 398.812500 119.687500 1384.760417 2263.833333 13616.260417 1.296521e+08 254207.239583 22957.885417 1.458333 570.093750 8.283438e+07 2012.458333 5176.145833 6.971875 2.239896 60666.666667
5 494.078947 131.026316 1228.842105 2452.000000 14874.210526 1.422731e+08 331133.184211 25063.657895 1.657895 804.605263 1.010395e+08 2013.263158 5788.868421 7.442105 2.258947 85842.105263
6 501.333333 131.333333 1779.904762 3509.428571 16547.047619 1.771279e+08 459085.333333 28951.000000 2.142857 1110.714286 8.970476e+07 2012.428571 5472.047619 7.509524 2.250476 113904.761905
7 574.250000 131.083333 3284.833333 1755.833333 14489.500000 1.852365e+08 470265.500000 22592.000000 0.583333 1052.166667 6.869167e+07 2013.166667 4824.416667 7.758333 2.118333 140166.666667
8 624.400000 151.800000 8815.400000 9490.000000 18600.000000 2.726506e+08 837976.800000 46430.400000 0.500000 1700.400000 1.424000e+08 2012.600000 10433.200000 8.280000 2.350000 164600.000000
9 683.250000 156.000000 4187.500000 985.250000 25000.000000 2.075798e+08 571334.000000 40640.250000 0.250000 1746.750000 1.587500e+08 2014.500000 12750.000000 7.900000 2.350000 194250.000000
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 712.000000 169.000000 22000.000000 6000.000000 11000.000000 1.879914e+08 928227.000000 31488.000000 1.000000 2725.000000 1.650000e+08 2014.000000 11000.000000 8.600000 2.350000 349000.000000
  • From the mean() testing above, we can summarize the the following.
    • The higher the movie facebook likes, the better the number of critic for reviews
    • Bin number 8 of movie facebook likes, is highly correlated to director facebook likes
    • Bin number 8 is also highly correlated to actor 3 facebook likes.
    • Intersting though gross is more correlated to bin 3 on movie facebook likes.
In [50]:
# Let's look at this from a heatmap perspective
plt.figure(figsize=(12,12))
sns.heatmap(df_likes.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Greens')
plt.title("Correlation on movie_facebook_likes");
  • We see a few things from this heat map.
    • Movie Facebook_Likes is highly correlated to number of critic for review, similar to what we saw above.
    • What really stands out is the actor 1 facebook likes, are highly attributed to the cast total facebook likes.
    • Number of voted users also is related heavily to number of users for review.

Plotting

In [51]:
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba

# Let's look at the imdb score along side num_voted_users.
# Let's show the breakdown by title_year, and display the budget for year on hover.
# The size of the bubble will be the movies aspect_ratio.
px.scatter(df, x="imdb_score", y="num_voted_users", color="title_year", hover_name='budget', size='aspect_ratio')
  • This graph appears to show that a majority of the imbd_scores are in the 2007 - 2015 range, and the aspect ratios are about the same across the board. The higher the imdb_score, the higher the number of voters.
In [52]:
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba

# Let's look at the imdb score along side duration. 
# Let's show the breakdown by title_year, and display the budget for each year on hover. 
# The size of theb bubble will be how much the movie grosseed
px.scatter(df, x="imdb_score", y="duration", color="title_year", hover_name='budget', size='gross')

This graph appears to show the imdb_scores are generally in the range of 6-8 with a fairly low duration. This then breaks down the data by title year, and allows you to see the budget and gross on hover.

In [53]:
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba

px.scatter(df, x="imdb_score", y="duration", color="title_year", text='director_name')
  • Looking at this plotly chart it is clearly shows that a majority of the movies range in the 6 - 8 imbd_score, and that their movie durations are fairly low. It is a rare instance where the movie duration is really long, and they receive a high imdb_score. See Ron Maxwell, Taylor Hackford, etc.
In [54]:
px.scatter(df, x="imdb_score", y="title_year", trendline='ols')
  • Even though the ols trend line is going down, this isn't a bad thing. This tells me that the imdb_scores overall lean toward the higher score, but the chart also shows us that the newer the movie the higher the imdb_score. It makes me wonder how much social media plays into this factor.

Final Project

Analysis:

  • Key findings from regression
  • Key findings from classification
  • Key findings from clustering
  • Your “best” classification model in terms of metrics (e.g., confusion matrix, AUC score)

Story telling: Overall suggestions and implications

  • What variables are considered important to predict imdb_score and movie success?
  • What recommendations do you have for movie producers / investors / viewers?
  • What additional variables would you need to improve the model prediction?
  • Any other suggestions

Regression

In [55]:
#assigning columns to X and Y variables
X = df['movie_facebook_likes']
y = df['imdb_score']
In [56]:
# We create the model and call it lr.
model1 = lm.LinearRegression()
# We train the model on our training dataset.
model1.fit(X[:,np.newaxis], y)    ## X needs to be 2d for LinearRegression so add [:,np.newaxis]
# Now, we predict points with our trained model.
model1_y = model1.predict(X[:,np.newaxis])
In [57]:
# The coefficients
print('Coefficients: ', model1.coef_)
# y-intercept
print("y-intercept ", model1.intercept_)
Coefficients:  [1.39570082e-05]
y-intercept  6.310128139629282

Linear Regression Model: y = 1.4x + 6.3

One unit increase in movie_facebook_likes increases imdb_score by about 6.3

In [58]:
# try to evaluate the performance of our model's prediction using visualization

plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)   #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
In [59]:
# Let's build 2nd model
X = df['duration']
y = df['imdb_score'] 
model2 = lm.LinearRegression()
model2.fit(X[:,np.newaxis], y)
model2_y = model2.predict(X[:,np.newaxis])
print('Coefficients: ', model2.coef_)
print("y-intercept ", model2.intercept_)
Coefficients:  [0.01694597]
y-intercept  4.585492572794051
In [60]:
# try to evaluate the performance of our model's prediction using visualization

plt.subplots()
plt.scatter(y, model2_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)   #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
In [61]:
# Choose a different variable as X and develop another (3rd) linear regression model (model3).
X = df['director_facebook_likes']
y = df['imdb_score'] 
model3 = lm.LinearRegression()
model3.fit(X[:,np.newaxis], y)
model3_y = model2.predict(X[:,np.newaxis])
print('Coefficients: ', model3.coef_)
print("y-intercept ", model2.intercept_)
Coefficients:  [6.67633662e-05]
y-intercept  4.585492572794051
In [62]:
# try to evaluate the performance of our model's prediction using visualization

plt.subplots()
plt.scatter(y, model3_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)   #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
In [63]:
print("%.10f" % model3.coef_)
0.0000667634
In [64]:
print("mean square error: ", mean_squared_error(y, model1_y))
print(explained_variance_score(y, model1_y))
mean square error:  1.0035945761451142
0.08218784688948111
In [65]:
print("mean square error: ", mean_squared_error(y, model2_y))
print(explained_variance_score(y, model2_y))
mean square error:  0.9574632427568129
0.1243760964370293
In [66]:
# evaluate your model (3rd)
print("mean square error: ", mean_squared_error(y, model3_y))
print(explained_variance_score(y, model3_y))
mean square error:  2736.75309115489
-2383.0801317935798

scikit-learn

In [67]:
# lm = scikit-learn

#assigning columns to X and Y variables
# Let's use the highest positive correlating value as seen above in the corr() section.
X = df['director_facebook_likes']
y = df['imdb_score']
In [68]:
# First, we create the model and call it lm.
model1 = lm.LinearRegression()
# Second, we train the model on our training dataset.
model1.fit(X[:,np.newaxis], y)    ## X needs to be 2d for LinearRegression so add [:,np.newaxis]
# Now, we predict points with our trained model.
model1_y = model1.predict(X[:,np.newaxis])
In [69]:
# The coefficients
print('Coefficients: ', model1.coef_)
# y-intercept
print("y-intercept ", model1.intercept_)
Coefficients:  [6.67633662e-05]
y-intercept  6.386226538447825

Linear Regression Model: y = 6.68 + 6.39

A six unit increase in director_facebook_likes increases imdb_score by about 6.

In [70]:
# Let's evaluate the performance of our model's prediction using visualization

plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)   
#dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title("Actual vs predicted")
plt.show()
In [71]:
# Let's see what model1 returns
model1
Out[71]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [72]:
# print the coefficients and the y-intercept
print('Coefficients: ', model1.coef_)
print("y-intercept ", model1.intercept_)
Coefficients:  [6.67633662e-05]
y-intercept  6.386226538447825
In [73]:
# Did we get a low MSE?
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
mean square error:  1.0526957760179436
variance or r-squared:  0.037283580717883846

Now we run the mean square error and r variance. We have a low MSE but a low R-Squared. This isn't a good regression model.

Statsmodel

In [74]:
# First we need to define y and X. 
y1 = df['imdb_score']

## Instead of choosing all of the other columns we are doing to drop "y" for X columns.
X1 = df.drop(['imdb_score','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating'], axis =1)
In [75]:
# Verify the X values are which columns. Retrun one row.
X1.head(1)
Out[75]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews ... Mystery Romance Sci-Fi Short Sport Thriller War Western return_on_investment imdbscores_bins
0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 ... 0 0 1 0 0 0 0 0 320.888543 4

1 rows × 40 columns

In [76]:
# Time to fit the model to lasso()
model2 = lm.Lasso(alpha=0.1)             #higher alpha (penality parameter), fewer predictors
model2.fit(X1, y1)
model2_y = model2.predict(X1)
In [77]:
# Let's see what model1 returns
model2
Out[77]:
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
In [78]:
# print the coefficients and the y-intercept
print('Coefficients: ', model2.coef_)
print("y-intercept ", model2.intercept_)
Coefficients:  [ 1.25189704e-03  5.53352570e-03  1.10077914e-06  5.20715706e-05
  4.94500446e-05 -9.87469278e-10  1.66314689e-06 -4.86944688e-05
 -0.00000000e+00 -2.59860993e-04 -2.32192746e-11 -1.04412511e-02
  5.02274110e-05 -0.00000000e+00 -5.57653786e-07 -0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00 -0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00 -0.00000000e+00
  0.00000000e+00  0.00000000e+00 -0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00 -0.00000000e+00
  0.00000000e+00 -0.00000000e+00 -1.28564465e-08  1.01725093e+00]
y-intercept  22.79707000920895
In [79]:
# Did we get a low MSE?
print("mean square error: ", mean_squared_error(y, model2_y))
print("variance or r-squared: ", explained_variance_score(y, model2_y))
mean square error:  0.2751596573155702
variance or r-squared:  0.7483596628232081

Stats model is not a good model in this multi-class analysis. The variance is really high and closer to 1 than 0.

Classification

Let's start by defining what each imdb_score represents categorically.

  • Create the column by “binning” the imdb_score into 4 categories (or buckets): “less than 4, 4-6, 6-8 and 8-10, which represents bad, OK, good and excellent respectively”
In [80]:
dfclass['imdb_category'] = pd.cut(dfclass['imdb_score'], bins=[0, 4, 6, 8, 10], labels=[4,6,8,10])
dfclass.head()
Out[80]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes imdb_category
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000 8
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0 8
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000 8
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000 10
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Action|Adventure|Sci-Fi ... English USA PG-13 263700000.0 2012.0 632.0 6.6 2.35 24000 8

5 rows × 29 columns

In [81]:
# Comparing values to Duration, see if they have any factor on IMDB Score.
# pivot table using both IMDB Category and Movie Facebook Likes
dfclass.groupby(['movie_facebook_likes', 'imdb_category']).size().sort_values(ascending=False).plot(figsize=(10,5))
plt.xlabel('Movie Facebook Likes, IMDB Category')
plt.ylabel('Count')
plt.title("IMDB Category grouped by Movie Facebook Likes");
In [82]:
# Comparing values to Duration, see if they have any factor on IMDB Score.
# pivot table using both IMDB Category and Duration
dfclass.groupby(['duration', 'imdb_category']).size().sort_values(ascending=False).plot(figsize=(10,5))
plt.xlabel('Duration, IMDB Category')
plt.ylabel('Count')
plt.title("IMDB Category grouped by Duration");
In [83]:
# Let's see what columns are objects, based on text in the results. 
# We will need to remove these object columns in the next step.
dfclass.head().T
Out[83]:
0 1 2 3 5
color Color Color Color Color Color
director_name James Cameron Gore Verbinski Sam Mendes Christopher Nolan Andrew Stanton
num_critic_for_reviews 723 302 602 813 462
duration 178 169 148 164 132
director_facebook_likes 0 563 0 22000 475
actor_3_facebook_likes 855 1000 161 23000 530
actor_2_name Joel David Moore Orlando Bloom Rory Kinnear Christian Bale Samantha Morton
actor_1_facebook_likes 1000 40000 11000 27000 640
gross 7.60506e+08 3.09404e+08 2.00074e+08 4.48131e+08 7.30587e+07
genres Action|Adventure|Fantasy|Sci-Fi Action|Adventure|Fantasy Action|Adventure|Thriller Action|Thriller Action|Adventure|Sci-Fi
actor_1_name CCH Pounder Johnny Depp Christoph Waltz Tom Hardy Daryl Sabara
movie_title Avatar Pirates of the Caribbean: At World's End Spectre The Dark Knight Rises John Carter
num_voted_users 886204 471220 275868 1144337 212204
cast_total_facebook_likes 4834 48350 11700 106759 1873
actor_3_name Wes Studi Jack Davenport Stephanie Sigman Joseph Gordon-Levitt Polly Walker
facenumber_in_poster 0 0 1 0 1
plot_keywords avatar|future|marine|native|paraplegic goddess|marriage ceremony|marriage proposal|pi... bomb|espionage|sequel|spy|terrorist deception|imprisonment|lawlessness|police offi... alien|american civil war|male nipple|mars|prin...
movie_imdb_link http://www.imdb.com/title/tt0499549/?ref_=fn_t... http://www.imdb.com/title/tt0449088/?ref_=fn_t... http://www.imdb.com/title/tt2379713/?ref_=fn_t... http://www.imdb.com/title/tt1345836/?ref_=fn_t... http://www.imdb.com/title/tt0401729/?ref_=fn_t...
num_user_for_reviews 3054 1238 994 2701 738
language English English English English English
country USA USA UK USA USA
content_rating PG-13 PG-13 PG-13 PG-13 PG-13
budget 2.37e+08 3e+08 2.45e+08 2.5e+08 2.637e+08
title_year 2009 2007 2015 2012 2012
actor_2_facebook_likes 936 5000 393 23000 632
imdb_score 7.9 7.1 6.8 8.5 6.6
aspect_ratio 1.78 2.35 2.35 2.35 2.35
movie_facebook_likes 33000 0 85000 164000 24000
imdb_category 8 8 8 10 8
In [84]:
# For classification to work, we need only integer columns.
# Drop all object columns
dfint = dfclass.drop(['gross','genres','budget','color','imdb_score','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating','language'], axis = 1)
dfint.head().T
Out[84]:
0 1 2 3 5
num_critic_for_reviews 723 302 602 813 462
duration 178 169 148 164 132
director_facebook_likes 0 563 0 22000 475
actor_3_facebook_likes 855 1000 161 23000 530
actor_1_facebook_likes 1000 40000 11000 27000 640
num_voted_users 886204 471220 275868 1144337 212204
cast_total_facebook_likes 4834 48350 11700 106759 1873
facenumber_in_poster 0 0 1 0 1
num_user_for_reviews 3054 1238 994 2701 738
title_year 2009 2007 2015 2012 2012
actor_2_facebook_likes 936 5000 393 23000 632
aspect_ratio 1.78 2.35 2.35 2.35 2.35
movie_facebook_likes 33000 0 85000 164000 24000
imdb_category 8 8 8 10 8
In [85]:
# Let's look at the break down by imdb_category
# 4 = bad
# 6 = ok
# 8 = good
# 10 = excellent

dfclass.groupby('imdb_category').size()
Out[85]:
imdb_category
4       95
6     1055
8     2467
10     158
dtype: int64
In [86]:
# Before we build our models and then declare them, let's set X and Y

y = dfint['imdb_category']
X = dfint.drop(['imdb_category'], axis = 1)  # put everything else into X

print(y.shape, X.shape)
(3775,) (3775, 13)

Decision Tree

In [87]:
# Split validation:train (70%) and test sets (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize decisiontreeclassifier()
dt = DecisionTreeClassifier()

# Train the model
dt = dt.fit(X_train, y_train)

dt
Out[87]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [88]:
#Model evaluation

print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
# print("--------------------------------------------------------")
# print(metrics.roc_auc_score(y_test, dt.predict(X_test)))
0.6566637246248896
--------------------------------------------------------
[[  3  17   7   0]
 [  7 150 153   0]
 [ 15 148 558  26]
 [  0   1  15  33]]
--------------------------------------------------------
              precision    recall  f1-score   support

           4       0.12      0.11      0.12        27
           6       0.47      0.48      0.48       310
           8       0.76      0.75      0.75       747
          10       0.56      0.67      0.61        49

    accuracy                           0.66      1133
   macro avg       0.48      0.50      0.49      1133
weighted avg       0.66      0.66      0.66      1133

66% accurate model
In [89]:
# Visualize decision tree
from graphviz import Source
from sklearn import tree
Source( tree.export_graphviz(dt, out_file=None, feature_names=X.columns))
Out[89]:
Tree 0 num_voted_users <= 67499.5 gini = 0.494 samples = 2642 value = [68, 745, 1720, 109] 1 duration <= 102.5 gini = 0.531 samples = 1529 value = [63, 601, 855, 10] 0->1 True 720 num_voted_users <= 532035.0 gini = 0.371 samples = 1113 value = [5, 144, 865, 99] 0->720 False 2 actor_3_facebook_likes <= 92.5 gini = 0.566 samples = 803 value = [53, 386, 358, 6] 1->2 417 cast_total_facebook_likes <= 1446.0 gini = 0.443 samples = 726 value = [10, 215, 497, 4] 1->417 3 num_user_for_reviews <= 219.5 gini = 0.466 samples = 164 value = [4, 42, 112, 6] 2->3 82 director_facebook_likes <= 371.5 gini = 0.556 samples = 639 value = [49, 344, 246, 0] 2->82 4 facenumber_in_poster <= 0.5 gini = 0.407 samples = 141 value = [3, 27, 105, 6] 3->4 71 num_critic_for_reviews <= 134.5 gini = 0.48 samples = 23 value = [1, 15, 7, 0] 3->71 5 movie_facebook_likes <= 24500.0 gini = 0.263 samples = 68 value = [1, 6, 58, 3] 4->5 32 title_year <= 2011.5 gini = 0.5 samples = 73 value = [2, 21, 47, 3] 4->32 6 num_critic_for_reviews <= 4.0 gini = 0.241 samples = 67 value = [1, 6, 58, 2] 5->6 31 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 5->31 7 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 6->7 8 director_facebook_likes <= 9.5 gini = 0.221 samples = 66 value = [1, 5, 58, 2] 6->8 9 num_critic_for_reviews <= 89.0 gini = 0.368 samples = 22 value = [1, 4, 17, 0] 8->9 18 num_critic_for_reviews <= 14.5 gini = 0.129 samples = 44 value = [0, 1, 41, 2] 8->18 10 cast_total_facebook_likes <= 1086.0 gini = 0.124 samples = 15 value = [1, 0, 14, 0] 9->10 13 director_facebook_likes <= 1.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 9->13 11 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 10->11 12 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 10->12 14 aspect_ratio <= 2.1 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 13->14 17 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 13->17 15 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 14->15 16 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 14->16 19 actor_2_facebook_likes <= 41.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 18->19 22 director_facebook_likes <= 303.5 gini = 0.091 samples = 42 value = [0, 0, 40, 2] 18->22 20 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 19->20 21 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 19->21 23 duration <= 99.5 gini = 0.05 samples = 39 value = [0, 0, 38, 1] 22->23 28 cast_total_facebook_likes <= 574.0 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 22->28 24 gini = 0.0 samples = 33 value = [0, 0, 33, 0] 23->24 25 num_user_for_reviews <= 49.0 gini = 0.278 samples = 6 value = [0, 0, 5, 1] 23->25 26 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 25->26 27 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 25->27 29 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 28->29 30 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 28->30 33 num_critic_for_reviews <= 50.5 gini = 0.432 samples = 63 value = [2, 15, 45, 1] 32->33 66 actor_2_facebook_likes <= 22.0 gini = 0.56 samples = 10 value = [0, 6, 2, 2] 32->66 34 num_critic_for_reviews <= 47.5 gini = 0.573 samples = 24 value = [1, 10, 12, 1] 33->34 49 actor_1_facebook_likes <= 668.5 gini = 0.267 samples = 39 value = [1, 5, 33, 0] 33->49 35 actor_1_facebook_likes <= 301.5 gini = 0.545 samples = 20 value = [1, 6, 12, 1] 34->35 48 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 34->48 36 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 35->36 37 actor_3_facebook_likes <= 15.5 gini = 0.622 samples = 14 value = [1, 6, 6, 1] 35->37 38 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 37->38 39 facenumber_in_poster <= 1.5 gini = 0.612 samples = 11 value = [1, 3, 6, 1] 37->39 40 num_voted_users <= 1843.0 gini = 0.611 samples = 6 value = [0, 3, 2, 1] 39->40 45 num_critic_for_reviews <= 20.0 gini = 0.32 samples = 5 value = [1, 0, 4, 0] 39->45 41 movie_facebook_likes <= 325.0 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 40->41 44 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 40->44 42 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 41->42 43 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 41->43 46 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 45->46 47 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 45->47 50 actor_2_facebook_likes <= 226.5 gini = 0.071 samples = 27 value = [0, 1, 26, 0] 49->50 55 director_facebook_likes <= 224.0 gini = 0.542 samples = 12 value = [1, 4, 7, 0] 49->55 51 gini = 0.0 samples = 23 value = [0, 0, 23, 0] 50->51 52 facenumber_in_poster <= 2.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 50->52 53 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 52->53 54 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 52->54 56 actor_2_facebook_likes <= 102.5 gini = 0.46 samples = 10 value = [1, 2, 7, 0] 55->56 65 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 55->65 57 director_facebook_likes <= 1.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 56->57 60 actor_2_facebook_likes <= 335.5 gini = 0.245 samples = 7 value = [1, 0, 6, 0] 56->60 58 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 57->58 59 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 57->59 61 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 60->61 62 cast_total_facebook_likes <= 1774.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 60->62 63 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 62->63 64 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 62->64 67 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 66->67 68 cast_total_facebook_likes <= 4841.5 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 66->68 69 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 68->69 70 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 68->70 72 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 71->72 73 actor_2_facebook_likes <= 118.0 gini = 0.56 samples = 15 value = [1, 7, 7, 0] 71->73 74 num_user_for_reviews <= 340.5 gini = 0.42 samples = 10 value = [0, 3, 7, 0] 73->74 79 actor_1_facebook_likes <= 350.0 gini = 0.32 samples = 5 value = [1, 4, 0, 0] 73->79 75 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 74->75 76 director_facebook_likes <= 50.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 74->76 77 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 76->77 78 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 76->78 80 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 79->80 81 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 79->81 83 num_voted_users <= 45814.5 gini = 0.55 samples = 595 value = [47, 333, 215, 0] 82->83 396 num_user_for_reviews <= 202.5 gini = 0.439 samples = 44 value = [2, 11, 31, 0] 82->396 84 movie_facebook_likes <= 184.5 gini = 0.541 samples = 508 value = [42, 297, 169, 0] 83->84 347 director_facebook_likes <= 23.5 gini = 0.546 samples = 87 value = [5, 36, 46, 0] 83->347 85 director_facebook_likes <= 197.0 gini = 0.565 samples = 233 value = [19, 120, 94, 0] 84->85 212 num_voted_users <= 7906.5 gini = 0.504 samples = 275 value = [23, 177, 75, 0] 84->212 86 num_voted_users <= 41409.5 gini = 0.563 samples = 216 value = [19, 116, 81, 0] 85->86 205 cast_total_facebook_likes <= 6112.0 gini = 0.36 samples = 17 value = [0, 4, 13, 0] 85->205 87 num_voted_users <= 28743.5 gini = 0.572 samples = 200 value = [18, 102, 80, 0] 86->87 200 facenumber_in_poster <= 5.5 gini = 0.227 samples = 16 value = [1, 14, 1, 0] 86->200 88 actor_3_facebook_likes <= 558.5 gini = 0.567 samples = 148 value = [17, 83, 48, 0] 87->88 177 duration <= 91.5 gini = 0.487 samples = 52 value = [1, 19, 32, 0] 87->177 89 num_user_for_reviews <= 213.5 gini = 0.583 samples = 98 value = [10, 48, 40, 0] 88->89 146 actor_2_facebook_likes <= 750.5 gini = 0.465 samples = 50 value = [7, 35, 8, 0] 88->146 90 actor_2_facebook_likes <= 631.0 gini = 0.591 samples = 92 value = [10, 42, 40, 0] 89->90 145 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 89->145 91 director_facebook_likes <= 16.0 gini = 0.599 samples = 71 value = [10, 36, 25, 0] 90->91 136 director_facebook_likes <= 69.5 gini = 0.408 samples = 21 value = [0, 6, 15, 0] 90->136 92 aspect_ratio <= 1.61 gini = 0.541 samples = 37 value = [6, 23, 8, 0] 91->92 115 num_user_for_reviews <= 29.5 gini = 0.59 samples = 34 value = [4, 13, 17, 0] 91->115 93 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 92->93 94 num_user_for_reviews <= 89.5 gini = 0.503 samples = 35 value = [4, 23, 8, 0] 92->94 95 actor_1_facebook_likes <= 898.5 gini = 0.57 samples = 25 value = [3, 14, 8, 0] 94->95 112 num_user_for_reviews <= 174.0 gini = 0.18 samples = 10 value = [1, 9, 0, 0] 94->112 96 aspect_ratio <= 1.975 gini = 0.593 samples = 18 value = [2, 8, 8, 0] 95->96 109 actor_1_facebook_likes <= 23500.0 gini = 0.245 samples = 7 value = [1, 6, 0, 0] 95->109 97 num_critic_for_reviews <= 92.0 gini = 0.512 samples = 11 value = [1, 7, 3, 0] 96->97 104 num_voted_users <= 620.0 gini = 0.449 samples = 7 value = [1, 1, 5, 0] 96->104 98 title_year <= 1993.0 gini = 0.37 samples = 9 value = [1, 7, 1, 0] 97->98 103 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 97->103 99 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 98->99 100 actor_3_facebook_likes <= 534.5 gini = 0.219 samples = 8 value = [0, 7, 1, 0] 98->100 101 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 100->101 102 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 100->102 105 num_critic_for_reviews <= 14.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 104->105 108 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 104->108 106 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 105->106 107 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 105->107 110 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 109->110 111 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 109->111 113 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 112->113 114 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 112->114 116 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 115->116 117 actor_1_facebook_likes <= 10500.0 gini = 0.571 samples = 30 value = [4, 9, 17, 0] 115->117 118 num_critic_for_reviews <= 54.0 gini = 0.532 samples = 27 value = [4, 6, 17, 0] 117->118 135 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 117->135 119 num_user_for_reviews <= 57.0 gini = 0.642 samples = 9 value = [4, 2, 3, 0] 118->119 126 num_voted_users <= 14640.0 gini = 0.346 samples = 18 value = [0, 4, 14, 0] 118->126 120 num_user_for_reviews <= 40.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 119->120 123 actor_3_facebook_likes <= 148.5 gini = 0.32 samples = 5 value = [4, 1, 0, 0] 119->123 121 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 120->121 122 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 120->122 124 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 123->124 125 gini = 0.0 samples = 4 value = [4, 0, 0, 0] 123->125 127 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 126->127 128 num_voted_users <= 19499.0 gini = 0.48 samples = 10 value = [0, 4, 6, 0] 126->128 129 facenumber_in_poster <= 1.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 128->129 132 title_year <= 2013.5 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 128->132 130 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 129->130 131 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 129->131 133 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 132->133 134 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 132->134 137 duration <= 94.5 gini = 0.332 samples = 19 value = [0, 4, 15, 0] 136->137 144 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 136->144 138 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 137->138 139 actor_2_facebook_likes <= 910.0 gini = 0.494 samples = 9 value = [0, 4, 5, 0] 137->139 140 director_facebook_likes <= 35.0 gini = 0.444 samples = 6 value = [0, 4, 2, 0] 139->140 143 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 139->143 141 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 140->141 142 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 140->142 147 cast_total_facebook_likes <= 3973.0 gini = 0.5 samples = 8 value = [4, 4, 0, 0] 146->147 152 actor_2_facebook_likes <= 6000.0 gini = 0.414 samples = 42 value = [3, 31, 8, 0] 146->152 148 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 147->148 149 num_critic_for_reviews <= 104.0 gini = 0.32 samples = 5 value = [4, 1, 0, 0] 147->149 150 gini = 0.0 samples = 4 value = [4, 0, 0, 0] 149->150 151 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 149->151 153 cast_total_facebook_likes <= 3227.0 gini = 0.361 samples = 37 value = [3, 29, 5, 0] 152->153 174 actor_3_facebook_likes <= 655.0 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 152->174 154 num_voted_users <= 19335.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 153->154 157 num_critic_for_reviews <= 191.5 gini = 0.306 samples = 34 value = [3, 28, 3, 0] 153->157 155 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 154->155 156 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 154->156 158 director_facebook_likes <= 7.5 gini = 0.268 samples = 33 value = [2, 28, 3, 0] 157->158 173 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 157->173 159 director_facebook_likes <= 5.0 gini = 0.494 samples = 9 value = [1, 6, 2, 0] 158->159 166 num_critic_for_reviews <= 99.5 gini = 0.156 samples = 24 value = [1, 22, 1, 0] 158->166 160 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 159->160 161 actor_3_facebook_likes <= 709.0 gini = 0.625 samples = 4 value = [1, 1, 2, 0] 159->161 162 cast_total_facebook_likes <= 11760.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 161->162 165 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 161->165 163 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 162->163 164 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 162->164 167 gini = 0.0 samples = 15 value = [0, 15, 0, 0] 166->167 168 num_critic_for_reviews <= 104.5 gini = 0.37 samples = 9 value = [1, 7, 1, 0] 166->168 169 aspect_ratio <= 2.1 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 168->169 172 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 168->172 170 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 169->170 171 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 169->171 175 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 174->175 176 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 174->176 178 director_facebook_likes <= 2.5 gini = 0.444 samples = 18 value = [0, 12, 6, 0] 177->178 189 num_user_for_reviews <= 74.5 gini = 0.372 samples = 34 value = [1, 7, 26, 0] 177->189 179 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 178->179 180 aspect_ratio <= 1.8 gini = 0.32 samples = 15 value = [0, 12, 3, 0] 178->180 181 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 180->181 182 num_critic_for_reviews <= 197.0 gini = 0.245 samples = 14 value = [0, 12, 2, 0] 180->182 183 duration <= 89.5 gini = 0.142 samples = 13 value = [0, 12, 1, 0] 182->183 188 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 182->188 184 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 183->184 185 title_year <= 2008.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 183->185 186 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 185->186 187 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 185->187 190 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 189->190 191 title_year <= 1990.5 gini = 0.314 samples = 32 value = [1, 5, 26, 0] 189->191 192 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 191->192 193 num_critic_for_reviews <= 33.0 gini = 0.271 samples = 31 value = [0, 5, 26, 0] 191->193 194 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 193->194 195 num_user_for_reviews <= 135.5 gini = 0.231 samples = 30 value = [0, 4, 26, 0] 193->195 196 gini = 0.0 samples = 15 value = [0, 0, 15, 0] 195->196 197 num_user_for_reviews <= 184.0 gini = 0.391 samples = 15 value = [0, 4, 11, 0] 195->197 198 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 197->198 199 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 197->199 201 num_user_for_reviews <= 88.0 gini = 0.124 samples = 15 value = [0, 14, 1, 0] 200->201 204 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 200->204 202 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 201->202 203 gini = 0.0 samples = 14 value = [0, 14, 0, 0] 201->203 206 num_critic_for_reviews <= 13.0 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 205->206 209 cast_total_facebook_likes <= 27983.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 205->209 207 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 206->207 208 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 206->208 210 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 209->210 211 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 209->211 213 title_year <= 1994.5 gini = 0.574 samples = 83 value = [8, 43, 32, 0] 212->213 252 movie_facebook_likes <= 609.0 gini = 0.457 samples = 192 value = [15, 134, 43, 0] 212->252 214 gini = 0.0 samples = 14 value = [0, 14, 0, 0] 213->214 215 movie_facebook_likes <= 263.0 gini = 0.595 samples = 69 value = [8, 29, 32, 0] 213->215 216 duration <= 88.5 gini = 0.439 samples = 14 value = [3, 10, 1, 0] 215->216 223 title_year <= 2008.5 gini = 0.555 samples = 55 value = [5, 19, 31, 0] 215->223 217 cast_total_facebook_likes <= 3552.0 gini = 0.612 samples = 7 value = [3, 3, 1, 0] 216->217 222 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 216->222 218 duration <= 85.0 gini = 0.375 samples = 4 value = [3, 0, 1, 0] 217->218 221 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 217->221 219 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 218->219 220 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 218->220 224 num_user_for_reviews <= 111.0 gini = 0.502 samples = 43 value = [4, 11, 28, 0] 223->224 247 num_voted_users <= 1688.0 gini = 0.486 samples = 12 value = [1, 8, 3, 0] 223->247 225 num_critic_for_reviews <= 19.5 gini = 0.429 samples = 39 value = [2, 9, 28, 0] 224->225 244 num_user_for_reviews <= 138.5 gini = 0.5 samples = 4 value = [2, 2, 0, 0] 224->244 226 movie_facebook_likes <= 683.5 gini = 0.625 samples = 4 value = [2, 1, 1, 0] 225->226 231 facenumber_in_poster <= 0.5 gini = 0.353 samples = 35 value = [0, 8, 27, 0] 225->231 227 aspect_ratio <= 1.59 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 226->227 230 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 226->230 228 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 227->228 229 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 227->229 232 duration <= 93.0 gini = 0.496 samples = 11 value = [0, 5, 6, 0] 231->232 237 num_user_for_reviews <= 21.5 gini = 0.219 samples = 24 value = [0, 3, 21, 0] 231->237 233 num_user_for_reviews <= 33.5 gini = 0.408 samples = 7 value = [0, 5, 2, 0] 232->233 236 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 232->236 234 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 233->234 235 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 233->235 238 num_user_for_reviews <= 14.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 237->238 241 actor_3_facebook_likes <= 918.0 gini = 0.091 samples = 21 value = [0, 1, 20, 0] 237->241 239 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 238->239 240 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 238->240 242 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 241->242 243 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 241->243 245 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 244->245 246 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 244->246 248 num_critic_for_reviews <= 7.0 gini = 0.375 samples = 4 value = [1, 0, 3, 0] 247->248 251 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 247->251 249 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 248->249 250 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 248->250 253 num_critic_for_reviews <= 29.5 gini = 0.34 samples = 75 value = [8, 60, 7, 0] 252->253 290 director_facebook_likes <= 188.5 gini = 0.502 samples = 117 value = [7, 74, 36, 0] 252->290 254 num_voted_users <= 15444.5 gini = 0.48 samples = 5 value = [3, 2, 0, 0] 253->254 257 actor_1_facebook_likes <= 410.0 gini = 0.298 samples = 70 value = [5, 58, 7, 0] 253->257 255 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 254->255 256 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 254->256 258 num_critic_for_reviews <= 79.5 gini = 0.625 samples = 4 value = [1, 1, 2, 0] 257->258 263 director_facebook_likes <= 249.5 gini = 0.245 samples = 66 value = [4, 57, 5, 0] 257->263 259 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 258->259 260 facenumber_in_poster <= 2.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 258->260 261 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 260->261 262 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 260->262 264 title_year <= 1993.5 gini = 0.201 samples = 64 value = [3, 57, 4, 0] 263->264 287 num_user_for_reviews <= 173.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 263->287 265 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 264->265 266 num_user_for_reviews <= 90.5 gini = 0.177 samples = 63 value = [3, 57, 3, 0] 264->266 267 num_user_for_reviews <= 84.5 gini = 0.328 samples = 26 value = [2, 21, 3, 0] 266->267 282 actor_3_facebook_likes <= 659.5 gini = 0.053 samples = 37 value = [1, 36, 0, 0] 266->282 268 actor_1_facebook_likes <= 566.5 gini = 0.226 samples = 24 value = [1, 21, 2, 0] 267->268 279 num_voted_users <= 14038.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 267->279 269 aspect_ratio <= 2.1 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 268->269 272 num_critic_for_reviews <= 56.5 gini = 0.165 samples = 22 value = [0, 20, 2, 0] 268->272 270 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 269->270 271 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 269->271 273 gini = 0.0 samples = 13 value = [0, 13, 0, 0] 272->273 274 cast_total_facebook_likes <= 3179.0 gini = 0.346 samples = 9 value = [0, 7, 2, 0] 272->274 275 facenumber_in_poster <= 1.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 274->275 278 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 274->278 276 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 275->276 277 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 275->277 280 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 279->280 281 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 279->281 283 gini = 0.0 samples = 32 value = [0, 32, 0, 0] 282->283 284 actor_2_facebook_likes <= 737.5 gini = 0.32 samples = 5 value = [1, 4, 0, 0] 282->284 285 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 284->285 286 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 284->286 288 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 287->288 289 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 287->289 291 num_user_for_reviews <= 422.5 gini = 0.527 samples = 106 value = [7, 63, 36, 0] 290->291 346 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 290->346 292 num_critic_for_reviews <= 80.0 gini = 0.511 samples = 104 value = [5, 63, 36, 0] 291->292 345 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 291->345 293 duration <= 100.5 gini = 0.397 samples = 36 value = [2, 27, 7, 0] 292->293 308 num_voted_users <= 15820.0 gini = 0.536 samples = 68 value = [3, 36, 29, 0] 292->308 294 num_user_for_reviews <= 223.0 gini = 0.344 samples = 34 value = [2, 27, 5, 0] 293->294 307 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 293->307 295 cast_total_facebook_likes <= 43929.0 gini = 0.307 samples = 33 value = [1, 27, 5, 0] 294->295 306 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 294->306 296 cast_total_facebook_likes <= 3452.5 gini = 0.271 samples = 32 value = [1, 27, 4, 0] 295->296 305 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 295->305 297 actor_1_facebook_likes <= 897.0 gini = 0.5 samples = 14 value = [1, 9, 4, 0] 296->297 304 gini = 0.0 samples = 18 value = [0, 18, 0, 0] 296->304 298 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 297->298 299 num_critic_for_reviews <= 68.5 gini = 0.571 samples = 7 value = [1, 2, 4, 0] 297->299 300 actor_2_facebook_likes <= 439.0 gini = 0.32 samples = 5 value = [1, 0, 4, 0] 299->300 303 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 299->303 301 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 300->301 302 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 300->302 309 num_user_for_reviews <= 63.0 gini = 0.43 samples = 11 value = [1, 2, 8, 0] 308->309 316 actor_3_facebook_likes <= 208.0 gini = 0.507 samples = 57 value = [2, 34, 21, 0] 308->316 310 cast_total_facebook_likes <= 3025.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 309->310 313 duration <= 99.0 gini = 0.198 samples = 9 value = [0, 1, 8, 0] 309->313 311 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 310->311 312 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 310->312 314 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 313->314 315 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 313->315 317 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 316->317 318 num_voted_users <= 17238.0 gini = 0.536 samples = 48 value = [2, 25, 21, 0] 316->318 319 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 318->319 320 num_user_for_reviews <= 125.5 gini = 0.543 samples = 43 value = [2, 20, 21, 0] 318->320 321 actor_2_facebook_likes <= 7500.0 gini = 0.219 samples = 8 value = [0, 1, 7, 0] 320->321 324 actor_1_facebook_likes <= 2500.0 gini = 0.542 samples = 35 value = [2, 19, 14, 0] 320->324 322 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 321->322 323 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 321->323 325 movie_facebook_likes <= 647.5 gini = 0.506 samples = 25 value = [2, 16, 7, 0] 324->325 340 actor_2_facebook_likes <= 2000.0 gini = 0.42 samples = 10 value = [0, 3, 7, 0] 324->340 326 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 325->326 327 aspect_ratio <= 1.61 gini = 0.461 samples = 23 value = [2, 16, 5, 0] 325->327 328 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 327->328 329 movie_facebook_likes <= 27500.0 gini = 0.417 samples = 22 value = [1, 16, 5, 0] 327->329 330 actor_2_facebook_likes <= 994.5 gini = 0.381 samples = 21 value = [1, 16, 4, 0] 329->330 339 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 329->339 331 num_voted_users <= 26183.5 gini = 0.335 samples = 20 value = [1, 16, 3, 0] 330->331 338 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 330->338 332 gini = 0.0 samples = 10 value = [0, 10, 0, 0] 331->332 333 num_voted_users <= 30486.5 gini = 0.54 samples = 10 value = [1, 6, 3, 0] 331->333 334 num_user_for_reviews <= 249.0 gini = 0.375 samples = 4 value = [1, 0, 3, 0] 333->334 337 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 333->337 335 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 334->335 336 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 334->336 341 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 340->341 342 title_year <= 2002.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 340->342 343 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 342->343 344 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 342->344 348 num_user_for_reviews <= 202.5 gini = 0.414 samples = 41 value = [0, 12, 29, 0] 347->348 365 num_user_for_reviews <= 482.5 gini = 0.579 samples = 46 value = [5, 24, 17, 0] 347->365 349 num_voted_users <= 65873.5 gini = 0.18 samples = 20 value = [0, 2, 18, 0] 348->349 356 movie_facebook_likes <= 3956.0 gini = 0.499 samples = 21 value = [0, 10, 11, 0] 348->356 350 num_user_for_reviews <= 102.5 gini = 0.1 samples = 19 value = [0, 1, 18, 0] 349->350 355 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 349->355 351 num_user_for_reviews <= 97.0 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 350->351 354 gini = 0.0 samples = 15 value = [0, 0, 15, 0] 350->354 352 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 351->352 353 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 351->353 357 cast_total_facebook_likes <= 7492.0 gini = 0.484 samples = 17 value = [0, 10, 7, 0] 356->357 364 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 356->364 358 duration <= 99.5 gini = 0.298 samples = 11 value = [0, 9, 2, 0] 357->358 361 director_facebook_likes <= 8.5 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 357->361 359 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 358->359 360 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 358->360 362 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 361->362 363 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 361->363 366 actor_3_facebook_likes <= 153.5 gini = 0.549 samples = 44 value = [3, 24, 17, 0] 365->366 395 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 365->395 367 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 366->367 368 title_year <= 2014.5 gini = 0.535 samples = 41 value = [3, 24, 14, 0] 366->368 369 num_voted_users <= 50062.5 gini = 0.511 samples = 38 value = [3, 24, 11, 0] 368->369 394 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 368->394 370 duration <= 87.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 369->370 373 director_facebook_likes <= 87.0 gini = 0.471 samples = 32 value = [3, 22, 7, 0] 369->373 371 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 370->371 372 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 370->372 374 num_critic_for_reviews <= 128.0 gini = 0.34 samples = 23 value = [0, 18, 5, 0] 373->374 387 num_critic_for_reviews <= 163.5 gini = 0.642 samples = 9 value = [3, 4, 2, 0] 373->387 375 director_facebook_likes <= 30.0 gini = 0.5 samples = 8 value = [0, 4, 4, 0] 374->375 382 num_critic_for_reviews <= 190.5 gini = 0.124 samples = 15 value = [0, 14, 1, 0] 374->382 376 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 375->376 377 title_year <= 2001.5 gini = 0.444 samples = 6 value = [0, 4, 2, 0] 375->377 378 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 377->378 379 num_voted_users <= 55479.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 377->379 380 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 379->380 381 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 379->381 383 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 382->383 384 num_critic_for_reviews <= 210.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 382->384 385 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 384->385 386 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 384->386 388 cast_total_facebook_likes <= 3731.5 gini = 0.49 samples = 7 value = [3, 4, 0, 0] 387->388 393 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 387->393 389 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 388->389 390 num_user_for_reviews <= 291.5 gini = 0.375 samples = 4 value = [3, 1, 0, 0] 388->390 391 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 390->391 392 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 390->392 397 num_voted_users <= 11290.5 gini = 0.307 samples = 33 value = [1, 5, 27, 0] 396->397 410 cast_total_facebook_likes <= 2833.5 gini = 0.562 samples = 11 value = [1, 6, 4, 0] 396->410 398 duration <= 92.5 gini = 0.58 samples = 10 value = [1, 4, 5, 0] 397->398 405 facenumber_in_poster <= 2.5 gini = 0.083 samples = 23 value = [0, 1, 22, 0] 397->405 399 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 398->399 400 num_voted_users <= 3203.0 gini = 0.5 samples = 6 value = [1, 4, 1, 0] 398->400 401 actor_2_facebook_likes <= 428.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 400->401 404 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 400->404 402 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 401->402 403 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 401->403 406 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 405->406 407 director_facebook_likes <= 585.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 405->407 408 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 407->408 409 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 407->409 411 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 410->411 412 duration <= 93.0 gini = 0.5 samples = 6 value = [1, 1, 4, 0] 410->412 413 num_user_for_reviews <= 418.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 412->413 416 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 412->416 414 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 413->414 415 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 413->415 418 num_voted_users <= 61965.0 gini = 0.294 samples = 155 value = [1, 24, 128, 2] 417->418 475 duration <= 118.5 gini = 0.47 samples = 571 value = [9, 191, 369, 2] 417->475 419 num_voted_users <= 520.5 gini = 0.238 samples = 145 value = [0, 20, 125, 0] 418->419 468 title_year <= 2006.0 gini = 0.7 samples = 10 value = [1, 4, 3, 2] 418->468 420 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 419->420 421 actor_3_facebook_likes <= 106.0 gini = 0.22 samples = 143 value = [0, 18, 125, 0] 419->421 422 actor_2_facebook_likes <= 1.0 gini = 0.134 samples = 97 value = [0, 7, 90, 0] 421->422 445 actor_2_facebook_likes <= 124.0 gini = 0.364 samples = 46 value = [0, 11, 35, 0] 421->445 423 movie_facebook_likes <= 620.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 422->423 426 num_critic_for_reviews <= 61.5 gini = 0.118 samples = 95 value = [0, 6, 89, 0] 422->426 424 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 423->424 425 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 423->425 427 movie_facebook_likes <= 8000.0 gini = 0.278 samples = 24 value = [0, 4, 20, 0] 426->427 438 cast_total_facebook_likes <= 1068.5 gini = 0.055 samples = 71 value = [0, 2, 69, 0] 426->438 428 actor_2_facebook_likes <= 74.0 gini = 0.227 samples = 23 value = [0, 3, 20, 0] 427->428 437 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 427->437 429 actor_2_facebook_likes <= 59.5 gini = 0.375 samples = 12 value = [0, 3, 9, 0] 428->429 436 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 428->436 430 actor_1_facebook_likes <= 298.5 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 429->430 435 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 429->435 431 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 430->431 432 facenumber_in_poster <= 1.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 430->432 433 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 432->433 434 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 432->434 439 gini = 0.0 samples = 58 value = [0, 0, 58, 0] 438->439 440 actor_2_facebook_likes <= 82.5 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 438->440 441 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 440->441 442 cast_total_facebook_likes <= 1135.0 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 440->442 443 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 442->443 444 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 442->444 446 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 445->446 447 num_voted_users <= 8847.5 gini = 0.325 samples = 44 value = [0, 9, 35, 0] 445->447 448 num_voted_users <= 4210.5 gini = 0.496 samples = 11 value = [0, 5, 6, 0] 447->448 453 num_user_for_reviews <= 340.0 gini = 0.213 samples = 33 value = [0, 4, 29, 0] 447->453 449 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 448->449 450 title_year <= 1983.5 gini = 0.278 samples = 6 value = [0, 5, 1, 0] 448->450 451 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 450->451 452 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 450->452 454 actor_3_facebook_likes <= 247.5 gini = 0.17 samples = 32 value = [0, 3, 29, 0] 453->454 467 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 453->467 455 facenumber_in_poster <= 3.5 gini = 0.124 samples = 30 value = [0, 2, 28, 0] 454->455 464 director_facebook_likes <= 204.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 454->464 456 actor_3_facebook_likes <= 121.0 gini = 0.071 samples = 27 value = [0, 1, 26, 0] 455->456 461 num_user_for_reviews <= 148.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 455->461 457 aspect_ratio <= 2.1 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 456->457 460 gini = 0.0 samples = 23 value = [0, 0, 23, 0] 456->460 458 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 457->458 459 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 457->459 462 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 461->462 463 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 461->463 465 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 464->465 466 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 464->466 469 cast_total_facebook_likes <= 1011.0 gini = 0.611 samples = 6 value = [1, 0, 3, 2] 468->469 474 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 468->474 470 num_user_for_reviews <= 783.0 gini = 0.444 samples = 3 value = [1, 0, 0, 2] 469->470 473 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 469->473 471 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 470->471 472 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 470->472 476 movie_facebook_likes <= 35.0 gini = 0.503 samples = 364 value = [7, 146, 211, 0] 475->476 639 num_voted_users <= 3276.0 gini = 0.37 samples = 207 value = [2, 45, 158, 2] 475->639 477 num_critic_for_reviews <= 15.5 gini = 0.455 samples = 183 value = [3, 58, 122, 0] 476->477 554 director_facebook_likes <= 619.0 gini = 0.521 samples = 181 value = [4, 88, 89, 0] 476->554 478 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 477->478 479 num_critic_for_reviews <= 162.0 gini = 0.447 samples = 180 value = [3, 55, 122, 0] 477->479 480 num_critic_for_reviews <= 150.0 gini = 0.473 samples = 148 value = [3, 50, 95, 0] 479->480 547 actor_3_facebook_likes <= 95.0 gini = 0.264 samples = 32 value = [0, 5, 27, 0] 479->547 481 director_facebook_likes <= 23.5 gini = 0.456 samples = 142 value = [3, 44, 95, 0] 480->481 546 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 480->546 482 actor_2_facebook_likes <= 700.5 gini = 0.338 samples = 48 value = [1, 9, 38, 0] 481->482 497 director_facebook_likes <= 35.5 gini = 0.493 samples = 94 value = [2, 35, 57, 0] 481->497 483 actor_2_facebook_likes <= 615.0 gini = 0.453 samples = 26 value = [0, 9, 17, 0] 482->483 492 num_critic_for_reviews <= 36.5 gini = 0.087 samples = 22 value = [1, 0, 21, 0] 482->492 484 actor_2_facebook_likes <= 382.5 gini = 0.266 samples = 19 value = [0, 3, 16, 0] 483->484 489 duration <= 116.5 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 483->489 485 actor_3_facebook_likes <= 213.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 484->485 488 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 484->488 486 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 485->486 487 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 485->487 490 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 489->490 491 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 489->491 493 num_voted_users <= 20252.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 492->493 496 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 492->496 494 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 493->494 495 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 493->495 498 num_user_for_reviews <= 64.5 gini = 0.37 samples = 9 value = [1, 7, 1, 0] 497->498 503 actor_2_facebook_likes <= 988.5 gini = 0.457 samples = 85 value = [1, 28, 56, 0] 497->503 499 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 498->499 500 num_user_for_reviews <= 279.0 gini = 0.219 samples = 8 value = [1, 7, 0, 0] 498->500 501 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 500->501 502 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 500->502 504 cast_total_facebook_likes <= 14122.5 gini = 0.404 samples = 66 value = [1, 17, 48, 0] 503->504 535 duration <= 111.0 gini = 0.488 samples = 19 value = [0, 11, 8, 0] 503->535 505 title_year <= 2007.5 gini = 0.474 samples = 50 value = [1, 17, 32, 0] 504->505 534 gini = 0.0 samples = 16 value = [0, 0, 16, 0] 504->534 506 aspect_ratio <= 2.1 gini = 0.418 samples = 39 value = [1, 10, 28, 0] 505->506 527 num_user_for_reviews <= 67.0 gini = 0.463 samples = 11 value = [0, 7, 4, 0] 505->527 507 num_critic_for_reviews <= 26.0 gini = 0.18 samples = 20 value = [0, 2, 18, 0] 506->507 512 num_critic_for_reviews <= 61.0 gini = 0.543 samples = 19 value = [1, 8, 10, 0] 506->512 508 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 507->508 509 actor_1_facebook_likes <= 487.5 gini = 0.1 samples = 19 value = [0, 1, 18, 0] 507->509 510 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 509->510 511 gini = 0.0 samples = 18 value = [0, 0, 18, 0] 509->511 513 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 512->513 514 cast_total_facebook_likes <= 1733.5 gini = 0.508 samples = 16 value = [1, 5, 10, 0] 512->514 515 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 514->515 516 num_voted_users <= 20313.5 gini = 0.439 samples = 14 value = [1, 3, 10, 0] 514->516 517 num_critic_for_reviews <= 79.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 516->517 520 actor_1_facebook_likes <= 9000.0 gini = 0.278 samples = 12 value = [0, 2, 10, 0] 516->520 518 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 517->518 519 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 517->519 521 actor_3_facebook_likes <= 782.0 gini = 0.165 samples = 11 value = [0, 1, 10, 0] 520->521 526 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 520->526 522 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 521->522 523 title_year <= 2000.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 521->523 524 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 523->524 525 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 523->525 528 actor_1_facebook_likes <= 984.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 527->528 533 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 527->533 529 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 528->529 530 actor_1_facebook_likes <= 7000.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 528->530 531 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 530->531 532 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 530->532 536 num_critic_for_reviews <= 138.5 gini = 0.298 samples = 11 value = [0, 9, 2, 0] 535->536 541 facenumber_in_poster <= 0.5 gini = 0.375 samples = 8 value = [0, 2, 6, 0] 535->541 537 num_user_for_reviews <= 295.0 gini = 0.18 samples = 10 value = [0, 9, 1, 0] 536->537 540 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 536->540 538 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 537->538 539 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 537->539 542 actor_2_facebook_likes <= 11500.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 541->542 545 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 541->545 543 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 542->543 544 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 542->544 548 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 547->548 549 num_user_for_reviews <= 882.0 gini = 0.128 samples = 29 value = [0, 2, 27, 0] 547->549 550 director_facebook_likes <= 1368.5 gini = 0.069 samples = 28 value = [0, 1, 27, 0] 549->550 553 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 549->553 551 gini = 0.0 samples = 27 value = [0, 0, 27, 0] 550->551 552 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 550->552 555 duration <= 110.5 gini = 0.522 samples = 170 value = [4, 87, 79, 0] 554->555 636 actor_2_facebook_likes <= 193.5 gini = 0.165 samples = 11 value = [0, 1, 10, 0] 554->636 556 movie_facebook_likes <= 13000.0 gini = 0.52 samples = 105 value = [4, 60, 41, 0] 555->556 607 cast_total_facebook_likes <= 2380.0 gini = 0.486 samples = 65 value = [0, 27, 38, 0] 555->607 557 title_year <= 2007.0 gini = 0.499 samples = 88 value = [4, 55, 29, 0] 556->557 600 num_voted_users <= 56834.0 gini = 0.415 samples = 17 value = [0, 5, 12, 0] 556->600 558 actor_3_facebook_likes <= 483.0 gini = 0.528 samples = 68 value = [3, 38, 27, 0] 557->558 591 facenumber_in_poster <= 0.5 gini = 0.265 samples = 20 value = [1, 17, 2, 0] 557->591 559 actor_3_facebook_likes <= 387.5 gini = 0.538 samples = 34 value = [2, 13, 19, 0] 558->559 576 actor_3_facebook_likes <= 581.5 gini = 0.403 samples = 34 value = [1, 25, 8, 0] 558->576 560 num_voted_users <= 11083.0 gini = 0.567 samples = 23 value = [2, 12, 9, 0] 559->560 573 num_voted_users <= 4825.5 gini = 0.165 samples = 11 value = [0, 1, 10, 0] 559->573 561 movie_facebook_likes <= 374.5 gini = 0.569 samples = 12 value = [2, 3, 7, 0] 560->561 568 director_facebook_likes <= 4.0 gini = 0.298 samples = 11 value = [0, 9, 2, 0] 560->568 562 movie_facebook_likes <= 159.0 gini = 0.653 samples = 7 value = [2, 3, 2, 0] 561->562 567 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 561->567 563 num_critic_for_reviews <= 9.5 gini = 0.5 samples = 4 value = [2, 0, 2, 0] 562->563 566 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 562->566 564 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 563->564 565 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 563->565 569 actor_1_facebook_likes <= 677.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 568->569 572 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 568->572 570 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 569->570 571 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 569->571 574 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 573->574 575 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 573->575 577 gini = 0.0 samples = 14 value = [0, 14, 0, 0] 576->577 578 movie_facebook_likes <= 302.5 gini = 0.535 samples = 20 value = [1, 11, 8, 0] 576->578 579 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 578->579 580 cast_total_facebook_likes <= 38011.0 gini = 0.555 samples = 16 value = [1, 7, 8, 0] 578->580 581 num_critic_for_reviews <= 86.0 gini = 0.521 samples = 13 value = [1, 4, 8, 0] 580->581 590 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 580->590 582 actor_2_facebook_likes <= 849.0 gini = 0.594 samples = 8 value = [1, 4, 3, 0] 581->582 589 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 581->589 583 cast_total_facebook_likes <= 3509.5 gini = 0.444 samples = 3 value = [1, 0, 2, 0] 582->583 586 actor_2_facebook_likes <= 989.5 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 582->586 584 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 583->584 585 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 583->585 587 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 586->587 588 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 586->588 592 actor_1_facebook_likes <= 1923.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 591->592 595 director_facebook_likes <= 2.5 gini = 0.105 samples = 18 value = [0, 17, 1, 0] 591->595 593 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 592->593 594 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 592->594 596 actor_3_facebook_likes <= 276.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 595->596 599 gini = 0.0 samples = 16 value = [0, 16, 0, 0] 595->599 597 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 596->597 598 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 596->598 601 num_voted_users <= 26025.5 gini = 0.245 samples = 14 value = [0, 2, 12, 0] 600->601 606 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 600->606 602 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 601->602 603 director_facebook_likes <= 506.5 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 601->603 604 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 603->604 605 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 603->605 608 actor_1_facebook_likes <= 720.0 gini = 0.355 samples = 13 value = [0, 10, 3, 0] 607->608 613 movie_facebook_likes <= 19500.0 gini = 0.44 samples = 52 value = [0, 17, 35, 0] 607->613 609 num_critic_for_reviews <= 46.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 608->609 612 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 608->612 610 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 609->610 611 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 609->611 614 actor_1_facebook_likes <= 25500.0 gini = 0.369 samples = 45 value = [0, 11, 34, 0] 613->614 633 actor_2_facebook_likes <= 506.5 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 613->633 615 director_facebook_likes <= 49.5 gini = 0.331 samples = 43 value = [0, 9, 34, 0] 614->615 632 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 614->632 616 num_user_for_reviews <= 123.5 gini = 0.475 samples = 18 value = [0, 7, 11, 0] 615->616 625 aspect_ratio <= 2.37 gini = 0.147 samples = 25 value = [0, 2, 23, 0] 615->625 617 duration <= 116.5 gini = 0.298 samples = 11 value = [0, 2, 9, 0] 616->617 622 num_user_for_reviews <= 214.5 gini = 0.408 samples = 7 value = [0, 5, 2, 0] 616->622 618 facenumber_in_poster <= 5.0 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 617->618 621 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 617->621 619 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 618->619 620 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 618->620 623 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 622->623 624 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 622->624 626 actor_1_facebook_likes <= 956.0 gini = 0.08 samples = 24 value = [0, 1, 23, 0] 625->626 631 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 625->631 627 actor_1_facebook_likes <= 857.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 626->627 630 gini = 0.0 samples = 22 value = [0, 0, 22, 0] 626->630 628 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 627->628 629 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 627->629 634 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 633->634 635 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 633->635 637 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 636->637 638 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 636->638 640 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 639->640 641 num_user_for_reviews <= 219.5 gini = 0.349 samples = 202 value = [2, 40, 158, 2] 639->641 642 title_year <= 2014.5 gini = 0.248 samples = 131 value = [0, 19, 112, 0] 641->642 683 title_year <= 2008.5 gini = 0.491 samples = 71 value = [2, 21, 46, 2] 641->683 643 num_critic_for_reviews <= 159.0 gini = 0.226 samples = 123 value = [0, 16, 107, 0] 642->643 680 movie_facebook_likes <= 13500.0 gini = 0.469 samples = 8 value = [0, 3, 5, 0] 642->680 644 num_critic_for_reviews <= 156.5 gini = 0.262 samples = 103 value = [0, 16, 87, 0] 643->644 679 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 643->679 645 duration <= 136.5 gini = 0.251 samples = 102 value = [0, 15, 87, 0] 644->645 678 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 644->678 646 actor_2_facebook_likes <= 4000.0 gini = 0.299 samples = 82 value = [0, 15, 67, 0] 645->646 677 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 645->677 647 actor_2_facebook_likes <= 2500.0 gini = 0.337 samples = 70 value = [0, 15, 55, 0] 646->647 676 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 646->676 648 duration <= 135.5 gini = 0.309 samples = 68 value = [0, 13, 55, 0] 647->648 675 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 647->675 649 num_critic_for_reviews <= 39.5 gini = 0.294 samples = 67 value = [0, 12, 55, 0] 648->649 674 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 648->674 650 gini = 0.0 samples = 13 value = [0, 0, 13, 0] 649->650 651 num_critic_for_reviews <= 67.0 gini = 0.346 samples = 54 value = [0, 12, 42, 0] 649->651 652 num_user_for_reviews <= 177.5 gini = 0.48 samples = 20 value = [0, 8, 12, 0] 651->652 663 actor_1_facebook_likes <= 644.5 gini = 0.208 samples = 34 value = [0, 4, 30, 0] 651->663 653 cast_total_facebook_likes <= 3448.5 gini = 0.415 samples = 17 value = [0, 5, 12, 0] 652->653 662 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 652->662 654 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 653->654 655 num_voted_users <= 37249.0 gini = 0.5 samples = 10 value = [0, 5, 5, 0] 653->655 656 movie_facebook_likes <= 453.5 gini = 0.408 samples = 7 value = [0, 5, 2, 0] 655->656 661 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 655->661 657 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 656->657 658 num_user_for_reviews <= 140.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 656->658 659 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 658->659 660 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 658->660 664 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 663->664 665 num_voted_users <= 57184.0 gini = 0.117 samples = 32 value = [0, 2, 30, 0] 663->665 666 num_critic_for_reviews <= 121.0 gini = 0.064 samples = 30 value = [0, 1, 29, 0] 665->666 671 actor_2_facebook_likes <= 1332.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 665->671 667 gini = 0.0 samples = 26 value = [0, 0, 26, 0] 666->667 668 movie_facebook_likes <= 926.0 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 666->668 669 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 668->669 670 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 668->670 672 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 671->672 673 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 671->673 681 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 680->681 682 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 680->682 684 facenumber_in_poster <= 4.0 gini = 0.443 samples = 62 value = [2, 14, 44, 2] 683->684 715 num_critic_for_reviews <= 97.0 gini = 0.346 samples = 9 value = [0, 7, 2, 0] 683->715 685 actor_3_facebook_likes <= 418.0 gini = 0.395 samples = 59 value = [1, 13, 44, 1] 684->685 710 num_voted_users <= 33120.5 gini = 0.667 samples = 3 value = [1, 1, 0, 1] 684->710 686 num_critic_for_reviews <= 81.5 gini = 0.203 samples = 27 value = [1, 2, 24, 0] 685->686 693 actor_1_facebook_likes <= 866.0 gini = 0.49 samples = 32 value = [0, 11, 20, 1] 685->693 687 movie_facebook_likes <= 447.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 686->687 690 num_user_for_reviews <= 221.5 gini = 0.077 samples = 25 value = [0, 1, 24, 0] 686->690 688 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 687->688 689 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 687->689 691 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 690->691 692 gini = 0.0 samples = 24 value = [0, 0, 24, 0] 690->692 694 facenumber_in_poster <= 1.5 gini = 0.375 samples = 4 value = [0, 3, 0, 1] 693->694 697 cast_total_facebook_likes <= 5602.0 gini = 0.408 samples = 28 value = [0, 8, 20, 0] 693->697 695 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 694->695 696 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 694->696 698 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 697->698 699 duration <= 124.5 gini = 0.494 samples = 18 value = [0, 8, 10, 0] 697->699 700 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 699->700 701 actor_2_facebook_likes <= 807.5 gini = 0.473 samples = 13 value = [0, 8, 5, 0] 699->701 702 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 701->702 703 actor_2_facebook_likes <= 925.5 gini = 0.397 samples = 11 value = [0, 8, 3, 0] 701->703 704 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 703->704 705 num_voted_users <= 29258.5 gini = 0.5 samples = 6 value = [0, 3, 3, 0] 703->705 706 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 705->706 707 num_voted_users <= 62528.0 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 705->707 708 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 707->708 709 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 707->709 711 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 710->711 712 movie_facebook_likes <= 500.0 gini = 0.5 samples = 2 value = [1, 0, 0, 1] 710->712 713 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 712->713 714 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 712->714 716 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 715->716 717 actor_3_facebook_likes <= 274.5 gini = 0.219 samples = 8 value = [0, 7, 1, 0] 715->717 718 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 717->718 719 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 717->719 721 num_voted_users <= 116653.0 gini = 0.322 samples = 1057 value = [5, 144, 857, 51] 720->721 1054 actor_2_facebook_likes <= 62.5 gini = 0.245 samples = 56 value = [0, 0, 8, 48] 720->1054 722 title_year <= 1995.5 gini = 0.391 samples = 422 value = [4, 102, 313, 3] 721->722 887 title_year <= 1986.5 gini = 0.256 samples = 635 value = [1, 42, 544, 48] 721->887 723 title_year <= 1957.5 gini = 0.106 samples = 72 value = [0, 3, 68, 1] 722->723 738 actor_3_facebook_likes <= 184.5 gini = 0.43 samples = 350 value = [4, 99, 245, 2] 722->738 724 director_facebook_likes <= 303.0 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 723->724 727 actor_2_facebook_likes <= 990.0 gini = 0.082 samples = 70 value = [0, 3, 67, 0] 723->727 725 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 724->725 726 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 724->726 728 num_user_for_reviews <= 120.0 gini = 0.032 samples = 61 value = [0, 1, 60, 0] 727->728 733 cast_total_facebook_likes <= 20481.0 gini = 0.346 samples = 9 value = [0, 2, 7, 0] 727->733 729 num_user_for_reviews <= 103.0 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 728->729 732 gini = 0.0 samples = 56 value = [0, 0, 56, 0] 728->732 730 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 729->730 731 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 729->731 734 cast_total_facebook_likes <= 3756.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 733->734 737 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 733->737 735 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 734->735 736 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 734->736 739 movie_facebook_likes <= 53500.0 gini = 0.234 samples = 77 value = [1, 7, 67, 2] 738->739 764 duration <= 100.5 gini = 0.461 samples = 273 value = [3, 92, 178, 0] 738->764 740 actor_2_facebook_likes <= 931.0 gini = 0.214 samples = 76 value = [0, 7, 67, 2] 739->740 763 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 739->763 741 num_user_for_reviews <= 426.5 gini = 0.195 samples = 75 value = [0, 6, 67, 2] 740->741 762 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 740->762 742 num_user_for_reviews <= 107.5 gini = 0.127 samples = 60 value = [0, 2, 56, 2] 741->742 755 director_facebook_likes <= 123.0 gini = 0.391 samples = 15 value = [0, 4, 11, 0] 741->755 743 actor_2_facebook_likes <= 190.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 742->743 746 movie_facebook_likes <= 36500.0 gini = 0.099 samples = 58 value = [0, 2, 55, 1] 742->746 744 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 743->744 745 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 743->745 747 movie_facebook_likes <= 19500.0 gini = 0.069 samples = 56 value = [0, 2, 54, 0] 746->747 752 actor_1_facebook_likes <= 13565.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 746->752 748 gini = 0.0 samples = 41 value = [0, 0, 41, 0] 747->748 749 movie_facebook_likes <= 21500.0 gini = 0.231 samples = 15 value = [0, 2, 13, 0] 747->749 750 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 749->750 751 gini = 0.0 samples = 13 value = [0, 0, 13, 0] 749->751 753 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 752->753 754 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 752->754 756 title_year <= 2004.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 755->756 759 num_user_for_reviews <= 432.5 gini = 0.165 samples = 11 value = [0, 1, 10, 0] 755->759 757 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 756->757 758 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 756->758 760 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 759->760 761 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 759->761 765 cast_total_facebook_likes <= 3045.5 gini = 0.527 samples = 106 value = [3, 50, 53, 0] 764->765 820 facenumber_in_poster <= 2.5 gini = 0.376 samples = 167 value = [0, 42, 125, 0] 764->820 766 actor_2_facebook_likes <= 395.5 gini = 0.355 samples = 23 value = [1, 4, 18, 0] 765->766 777 num_user_for_reviews <= 250.5 gini = 0.514 samples = 83 value = [2, 46, 35, 0] 765->777 767 director_facebook_likes <= 74.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 766->767 772 num_critic_for_reviews <= 118.5 gini = 0.117 samples = 16 value = [1, 0, 15, 0] 766->772 768 title_year <= 2004.0 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 767->768 771 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 767->771 769 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 768->769 770 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 768->770 773 num_voted_users <= 75548.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 772->773 776 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 772->776 774 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 773->774 775 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 773->775 778 movie_facebook_likes <= 5480.0 gini = 0.47 samples = 37 value = [0, 14, 23, 0] 777->778 791 cast_total_facebook_likes <= 3454.0 gini = 0.446 samples = 46 value = [2, 32, 12, 0] 777->791 779 aspect_ratio <= 1.815 gini = 0.476 samples = 23 value = [0, 14, 9, 0] 778->779 790 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 778->790 780 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 779->780 781 actor_3_facebook_likes <= 748.5 gini = 0.42 samples = 20 value = [0, 14, 6, 0] 779->781 782 aspect_ratio <= 2.1 gini = 0.142 samples = 13 value = [0, 12, 1, 0] 781->782 785 actor_1_facebook_likes <= 17000.0 gini = 0.408 samples = 7 value = [0, 2, 5, 0] 781->785 783 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 782->783 784 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 782->784 786 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 785->786 787 actor_2_facebook_likes <= 984.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 785->787 788 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 787->788 789 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 787->789 792 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 791->792 793 num_voted_users <= 83225.0 gini = 0.417 samples = 44 value = [2, 32, 10, 0] 791->793 794 actor_2_facebook_likes <= 826.0 gini = 0.133 samples = 14 value = [0, 13, 1, 0] 793->794 799 facenumber_in_poster <= 9.5 gini = 0.504 samples = 30 value = [2, 19, 9, 0] 793->799 795 actor_3_facebook_likes <= 493.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 794->795 798 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 794->798 796 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 795->796 797 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 795->797 800 actor_1_facebook_likes <= 12500.0 gini = 0.473 samples = 29 value = [1, 19, 9, 0] 799->800 819 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 799->819 801 duration <= 81.5 gini = 0.381 samples = 17 value = [1, 13, 3, 0] 800->801 812 num_user_for_reviews <= 472.5 gini = 0.5 samples = 12 value = [0, 6, 6, 0] 800->812 802 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 801->802 803 actor_2_facebook_likes <= 844.5 gini = 0.32 samples = 16 value = [1, 13, 2, 0] 801->803 804 num_critic_for_reviews <= 140.0 gini = 0.531 samples = 8 value = [1, 5, 2, 0] 803->804 811 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 803->811 805 num_critic_for_reviews <= 120.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 804->805 808 title_year <= 2010.0 gini = 0.278 samples = 6 value = [0, 5, 1, 0] 804->808 806 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 805->806 807 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 805->807 809 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 808->809 810 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 808->810 813 actor_1_facebook_likes <= 19500.0 gini = 0.444 samples = 9 value = [0, 6, 3, 0] 812->813 818 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 812->818 814 title_year <= 2010.5 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 813->814 817 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 813->817 815 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 814->815 816 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 814->816 821 actor_3_facebook_likes <= 978.5 gini = 0.316 samples = 132 value = [0, 26, 106, 0] 820->821 870 actor_1_facebook_likes <= 16000.0 gini = 0.496 samples = 35 value = [0, 16, 19, 0] 820->870 822 duration <= 110.5 gini = 0.276 samples = 121 value = [0, 20, 101, 0] 821->822 863 actor_2_facebook_likes <= 13000.0 gini = 0.496 samples = 11 value = [0, 6, 5, 0] 821->863 823 num_voted_users <= 79697.0 gini = 0.392 samples = 56 value = [0, 15, 41, 0] 822->823 850 actor_3_facebook_likes <= 626.0 gini = 0.142 samples = 65 value = [0, 5, 60, 0] 822->850 824 num_user_for_reviews <= 171.0 gini = 0.499 samples = 19 value = [0, 9, 10, 0] 823->824 837 num_user_for_reviews <= 440.5 gini = 0.272 samples = 37 value = [0, 6, 31, 0] 823->837 825 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 824->825 826 num_voted_users <= 68183.0 gini = 0.48 samples = 15 value = [0, 9, 6, 0] 824->826 827 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 826->827 828 num_critic_for_reviews <= 145.5 gini = 0.426 samples = 13 value = [0, 9, 4, 0] 826->828 829 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 828->829 830 num_critic_for_reviews <= 173.0 gini = 0.494 samples = 9 value = [0, 5, 4, 0] 828->830 831 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 830->831 832 cast_total_facebook_likes <= 4576.5 gini = 0.278 samples = 6 value = [0, 5, 1, 0] 830->832 833 duration <= 106.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 832->833 836 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 832->836 834 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 833->834 835 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 833->835 838 title_year <= 2012.0 gini = 0.165 samples = 33 value = [0, 3, 30, 0] 837->838 847 actor_3_facebook_likes <= 400.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 837->847 839 gini = 0.0 samples = 21 value = [0, 0, 21, 0] 838->839 840 num_critic_for_reviews <= 260.5 gini = 0.375 samples = 12 value = [0, 3, 9, 0] 838->840 841 facenumber_in_poster <= 0.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 840->841 846 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 840->846 842 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 841->842 843 title_year <= 2014.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 841->843 844 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 843->844 845 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 843->845 848 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 847->848 849 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 847->849 851 actor_3_facebook_likes <= 619.5 gini = 0.229 samples = 38 value = [0, 5, 33, 0] 850->851 862 gini = 0.0 samples = 27 value = [0, 0, 27, 0] 850->862 852 num_user_for_reviews <= 370.0 gini = 0.193 samples = 37 value = [0, 4, 33, 0] 851->852 861 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 851->861 853 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 852->853 854 facenumber_in_poster <= 0.5 gini = 0.36 samples = 17 value = [0, 4, 13, 0] 852->854 855 title_year <= 1999.5 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 854->855 858 actor_1_facebook_likes <= 645.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 854->858 856 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 855->856 857 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 855->857 859 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 858->859 860 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 858->860 864 title_year <= 2000.5 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 863->864 869 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 863->869 865 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 864->865 866 director_facebook_likes <= 438.0 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 864->866 867 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 866->867 868 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 866->868 871 actor_3_facebook_likes <= 535.5 gini = 0.48 samples = 25 value = [0, 15, 10, 0] 870->871 882 actor_3_facebook_likes <= 9500.0 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 870->882 872 director_facebook_likes <= 169.5 gini = 0.444 samples = 12 value = [0, 4, 8, 0] 871->872 877 title_year <= 2014.0 gini = 0.26 samples = 13 value = [0, 11, 2, 0] 871->877 873 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 872->873 874 facenumber_in_poster <= 4.5 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 872->874 875 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 874->875 876 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 874->876 878 num_voted_users <= 72734.5 gini = 0.153 samples = 12 value = [0, 11, 1, 0] 877->878 881 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 877->881 879 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 878->879 880 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 878->880 883 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 882->883 884 movie_facebook_likes <= 9000.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 882->884 885 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 884->885 886 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 884->886 888 num_voted_users <= 149368.5 gini = 0.475 samples = 36 value = [0, 0, 14, 22] 887->888 897 num_voted_users <= 414144.5 gini = 0.21 samples = 599 value = [1, 42, 530, 26] 887->897 889 actor_2_facebook_likes <= 27.5 gini = 0.18 samples = 10 value = [0, 0, 9, 1] 888->889 892 actor_2_facebook_likes <= 704.0 gini = 0.311 samples = 26 value = [0, 0, 5, 21] 888->892 890 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 889->890 891 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 889->891 893 gini = 0.0 samples = 17 value = [0, 0, 0, 17] 892->893 894 num_voted_users <= 287927.5 gini = 0.494 samples = 9 value = [0, 0, 5, 4] 892->894 895 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 894->895 896 gini = 0.0 samples = 4 value = [0, 0, 0, 4] 894->896 898 num_voted_users <= 229755.0 gini = 0.188 samples = 556 value = [1, 42, 499, 14] 897->898 1035 num_critic_for_reviews <= 251.0 gini = 0.402 samples = 43 value = [0, 0, 31, 12] 897->1035 899 num_user_for_reviews <= 1212.5 gini = 0.227 samples = 361 value = [1, 39, 315, 6] 898->899 1004 movie_facebook_likes <= 73000.0 gini = 0.108 samples = 195 value = [0, 3, 184, 8] 898->1004 900 title_year <= 2007.5 gini = 0.21 samples = 351 value = [1, 34, 310, 6] 899->900 999 duration <= 123.0 gini = 0.5 samples = 10 value = [0, 5, 5, 0] 899->999 901 facenumber_in_poster <= 2.5 gini = 0.13 samples = 189 value = [1, 9, 176, 3] 900->901 940 num_user_for_reviews <= 529.0 gini = 0.292 samples = 162 value = [0, 25, 134, 3] 900->940 902 actor_3_facebook_likes <= 10500.0 gini = 0.086 samples = 157 value = [0, 4, 150, 3] 901->902 925 cast_total_facebook_likes <= 1799.0 gini = 0.314 samples = 32 value = [1, 5, 26, 0] 901->925 903 actor_1_facebook_likes <= 97.0 gini = 0.075 samples = 156 value = [0, 3, 150, 3] 902->903 924 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 902->924 904 actor_3_facebook_likes <= 19.0 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 903->904 907 director_facebook_likes <= 5000.0 gini = 0.064 samples = 153 value = [0, 3, 148, 2] 903->907 905 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 904->905 906 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 904->906 908 num_critic_for_reviews <= 72.5 gini = 0.041 samples = 142 value = [0, 3, 139, 0] 907->908 921 director_facebook_likes <= 9000.0 gini = 0.298 samples = 11 value = [0, 0, 9, 2] 907->921 909 title_year <= 1995.5 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 908->909 912 director_facebook_likes <= 678.5 gini = 0.029 samples = 137 value = [0, 2, 135, 0] 908->912 910 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 909->910 911 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 909->911 913 num_user_for_reviews <= 815.0 gini = 0.016 samples = 125 value = [0, 1, 124, 0] 912->913 918 director_facebook_likes <= 712.0 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 912->918 914 gini = 0.0 samples = 112 value = [0, 0, 112, 0] 913->914 915 num_user_for_reviews <= 839.5 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 913->915 916 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 915->916 917 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 915->917 919 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 918->919 920 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 918->920 922 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 921->922 923 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 921->923 926 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 925->926 927 num_user_for_reviews <= 942.5 gini = 0.238 samples = 30 value = [1, 3, 26, 0] 925->927 928 title_year <= 2006.5 gini = 0.185 samples = 29 value = [0, 3, 26, 0] 927->928 939 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 927->939 929 actor_1_facebook_likes <= 22500.0 gini = 0.133 samples = 28 value = [0, 2, 26, 0] 928->929 938 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 928->938 930 num_critic_for_reviews <= 200.0 gini = 0.074 samples = 26 value = [0, 1, 25, 0] 929->930 935 director_facebook_likes <= 164.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 929->935 931 gini = 0.0 samples = 22 value = [0, 0, 22, 0] 930->931 932 num_critic_for_reviews <= 217.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 930->932 933 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 932->933 934 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 932->934 936 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 935->936 937 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 935->937 941 movie_facebook_likes <= 61000.0 gini = 0.228 samples = 133 value = [0, 14, 116, 3] 940->941 984 movie_facebook_likes <= 45000.0 gini = 0.471 samples = 29 value = [0, 11, 18, 0] 940->984 942 actor_1_facebook_likes <= 75.5 gini = 0.177 samples = 124 value = [0, 10, 112, 2] 941->942 975 movie_facebook_likes <= 68000.0 gini = 0.593 samples = 9 value = [0, 4, 4, 1] 941->975 943 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 942->943 944 actor_1_facebook_likes <= 46500.0 gini = 0.164 samples = 123 value = [0, 10, 112, 1] 942->944 945 actor_3_facebook_likes <= 377.5 gini = 0.152 samples = 122 value = [0, 9, 112, 1] 944->945 974 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 944->974 946 gini = 0.0 samples = 46 value = [0, 0, 46, 0] 945->946 947 actor_3_facebook_likes <= 384.5 gini = 0.232 samples = 76 value = [0, 9, 66, 1] 945->947 948 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 947->948 949 movie_facebook_likes <= 20000.0 gini = 0.214 samples = 75 value = [0, 8, 66, 1] 947->949 950 num_user_for_reviews <= 249.0 gini = 0.355 samples = 26 value = [0, 6, 20, 0] 949->950 961 cast_total_facebook_likes <= 2865.0 gini = 0.117 samples = 49 value = [0, 2, 46, 1] 949->961 951 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 950->951 952 actor_3_facebook_likes <= 534.0 gini = 0.49 samples = 14 value = [0, 6, 8, 0] 950->952 953 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 952->953 954 title_year <= 2009.0 gini = 0.397 samples = 11 value = [0, 3, 8, 0] 952->954 955 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 954->955 956 num_critic_for_reviews <= 250.0 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 954->956 957 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 956->957 958 num_critic_for_reviews <= 381.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 956->958 959 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 958->959 960 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 958->960 962 director_facebook_likes <= 150.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 961->962 965 cast_total_facebook_likes <= 31618.0 gini = 0.081 samples = 47 value = [0, 2, 45, 0] 961->965 963 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 962->963 964 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 962->964 966 gini = 0.0 samples = 33 value = [0, 0, 33, 0] 965->966 967 cast_total_facebook_likes <= 33949.5 gini = 0.245 samples = 14 value = [0, 2, 12, 0] 965->967 968 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 967->968 969 title_year <= 2010.5 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 967->969 970 duration <= 107.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 969->970 973 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 969->973 971 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 970->971 972 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 970->972 976 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 975->976 977 num_voted_users <= 137668.5 gini = 0.5 samples = 6 value = [0, 1, 4, 1] 975->977 978 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 977->978 979 actor_1_facebook_likes <= 3372.5 gini = 0.32 samples = 5 value = [0, 0, 4, 1] 977->979 980 director_facebook_likes <= 219.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 979->980 983 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 979->983 981 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 980->981 982 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 980->982 985 cast_total_facebook_likes <= 3476.0 gini = 0.488 samples = 19 value = [0, 11, 8, 0] 984->985 998 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 984->998 986 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 985->986 987 num_critic_for_reviews <= 390.5 gini = 0.391 samples = 15 value = [0, 11, 4, 0] 985->987 988 num_voted_users <= 180728.5 gini = 0.278 samples = 12 value = [0, 10, 2, 0] 987->988 995 director_facebook_likes <= 253.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 987->995 989 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 988->989 990 duration <= 125.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 988->990 991 num_critic_for_reviews <= 334.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 990->991 994 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 990->994 992 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 991->992 993 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 991->993 996 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 995->996 997 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 995->997 1000 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 999->1000 1001 actor_2_facebook_likes <= 4917.0 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 999->1001 1002 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 1001->1002 1003 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1001->1003 1005 cast_total_facebook_likes <= 1019.5 gini = 0.068 samples = 171 value = [0, 3, 165, 3] 1004->1005 1024 movie_facebook_likes <= 77500.0 gini = 0.33 samples = 24 value = [0, 0, 19, 5] 1004->1024 1006 num_user_for_reviews <= 1125.0 gini = 0.444 samples = 3 value = [0, 0, 1, 2] 1005->1006 1009 duration <= 148.5 gini = 0.047 samples = 168 value = [0, 3, 164, 1] 1005->1009 1007 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1006->1007 1008 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1006->1008 1010 actor_3_facebook_likes <= 9500.0 gini = 0.014 samples = 144 value = [0, 1, 143, 0] 1009->1010 1015 actor_1_facebook_likes <= 980.0 gini = 0.226 samples = 24 value = [0, 2, 21, 1] 1009->1015 1011 gini = 0.0 samples = 137 value = [0, 0, 137, 0] 1010->1011 1012 actor_3_facebook_likes <= 10500.0 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 1010->1012 1013 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1012->1013 1014 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 1012->1014 1016 num_voted_users <= 241725.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 1015->1016 1019 title_year <= 1996.5 gini = 0.091 samples = 21 value = [0, 0, 20, 1] 1015->1019 1017 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1016->1017 1018 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1016->1018 1020 num_user_for_reviews <= 539.0 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1019->1020 1023 gini = 0.0 samples = 19 value = [0, 0, 19, 0] 1019->1023 1021 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1020->1021 1022 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1020->1022 1025 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1024->1025 1026 num_voted_users <= 397355.5 gini = 0.236 samples = 22 value = [0, 0, 19, 3] 1024->1026 1027 actor_2_facebook_likes <= 14000.0 gini = 0.172 samples = 21 value = [0, 0, 19, 2] 1026->1027 1034 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1026->1034 1028 movie_facebook_likes <= 84000.0 gini = 0.095 samples = 20 value = [0, 0, 19, 1] 1027->1028 1033 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1027->1033 1029 cast_total_facebook_likes <= 28808.5 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 1028->1029 1032 gini = 0.0 samples = 17 value = [0, 0, 17, 0] 1028->1032 1030 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1029->1030 1031 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1029->1031 1036 gini = 0.0 samples = 5 value = [0, 0, 0, 5] 1035->1036 1037 movie_facebook_likes <= 82500.0 gini = 0.301 samples = 38 value = [0, 0, 31, 7] 1035->1037 1038 num_user_for_reviews <= 523.5 gini = 0.175 samples = 31 value = [0, 0, 28, 3] 1037->1038 1049 duration <= 119.0 gini = 0.49 samples = 7 value = [0, 0, 3, 4] 1037->1049 1039 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1038->1039 1040 cast_total_facebook_likes <= 1611.5 gini = 0.124 samples = 30 value = [0, 0, 28, 2] 1038->1040 1041 actor_3_facebook_likes <= 140.0 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1040->1041 1044 num_critic_for_reviews <= 278.5 gini = 0.069 samples = 28 value = [0, 0, 27, 1] 1040->1044 1042 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1041->1042 1043 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1041->1043 1045 actor_3_facebook_likes <= 802.0 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1044->1045 1048 gini = 0.0 samples = 26 value = [0, 0, 26, 0] 1044->1048 1046 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1045->1046 1047 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1045->1047 1050 gini = 0.0 samples = 3 value = [0, 0, 0, 3] 1049->1050 1051 duration <= 132.0 gini = 0.375 samples = 4 value = [0, 0, 3, 1] 1049->1051 1052 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1051->1052 1053 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1051->1053 1055 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1054->1055 1056 num_critic_for_reviews <= 317.5 gini = 0.198 samples = 54 value = [0, 0, 6, 48] 1054->1056 1057 gini = 0.0 samples = 31 value = [0, 0, 0, 31] 1056->1057 1058 director_facebook_likes <= 525.5 gini = 0.386 samples = 23 value = [0, 0, 6, 17] 1056->1058 1059 aspect_ratio <= 2.1 gini = 0.496 samples = 11 value = [0, 0, 6, 5] 1058->1059 1064 gini = 0.0 samples = 12 value = [0, 0, 0, 12] 1058->1064 1060 gini = 0.0 samples = 4 value = [0, 0, 0, 4] 1059->1060 1061 title_year <= 1989.0 gini = 0.245 samples = 7 value = [0, 0, 6, 1] 1059->1061 1062 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1061->1062 1063 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 1061->1063
In [90]:
# visualizing the new decision tree (2nd option)
from sklearn.externals.six import StringIO
import pydotplus

dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data, feature_names=X.columns,
                     filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("data/dt.pdf")
Out[90]:
True

KNN

In [91]:
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)


# split validation  - validate your model before you run your model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# random state keeps the same 42 people each time.

# Initialize KNeighborsClassifier() ... name your decision model "knn"
knn = KNeighborsClassifier()  # default = 5 ... see below

# Train a decision tree model
# knn # empty model, we need to train the algorithm using fit
knn = knn.fit(X_train, y_train)

knn
Out[91]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
In [92]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html

print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
# print("--------------------------------------------------------")
# print(metrics.roc_auc_score(y_test, knn.predict(X_test)))
0.6557811120917917
--------------------------------------------------------
[[  0  12  15   0]
 [  6 122 182   0]
 [  7 135 591  14]
 [  0   1  18  30]]
--------------------------------------------------------
              precision    recall  f1-score   support

           4       0.00      0.00      0.00        27
           6       0.45      0.39      0.42       310
           8       0.73      0.79      0.76       747
          10       0.68      0.61      0.65        49

    accuracy                           0.66      1133
   macro avg       0.47      0.45      0.46      1133
weighted avg       0.64      0.66      0.64      1133

In [93]:
# evaluate the knn model using 10-fold cross-validation

scores = cross_val_score(KNeighborsClassifier(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
[0.66490765 0.70184697 0.69129288 0.67282322 0.69920844 0.65517241
 0.60212202 0.59308511 0.6        0.58666667]
0.6467125358430692
In [94]:
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}

#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5, iid=False)

#fit model to training data
knn_gs.fit(X_train, y_train)
Out[94]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid=False, n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [95]:
#save best model
knn_best = knn_gs.best_estimator_

#check best n_neigbors value
print(knn_gs.best_score_)
print(knn_gs.best_params_)
print(knn_gs.best_estimator_)
0.6684326907786103
{'n_neighbors': 24}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=24, p=2,
                     weights='uniform')

Best model?

Based on the data above, KNN is the best but not by much.

  • Decision Tree : 65%
  • KNN : 66%

Clustering

In [96]:
# Before we can cluster we need a clean set of data without objects. 

# Drop all object columns
dfcluster = dfclass.drop(['imdb_category','gross','genres','budget','color','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating','language'], axis = 1)
dfcluster.head().T
Out[96]:
0 1 2 3 5
num_critic_for_reviews 723.00 302.00 602.00 813.00 462.00
duration 178.00 169.00 148.00 164.00 132.00
director_facebook_likes 0.00 563.00 0.00 22000.00 475.00
actor_3_facebook_likes 855.00 1000.00 161.00 23000.00 530.00
actor_1_facebook_likes 1000.00 40000.00 11000.00 27000.00 640.00
num_voted_users 886204.00 471220.00 275868.00 1144337.00 212204.00
cast_total_facebook_likes 4834.00 48350.00 11700.00 106759.00 1873.00
facenumber_in_poster 0.00 0.00 1.00 0.00 1.00
num_user_for_reviews 3054.00 1238.00 994.00 2701.00 738.00
title_year 2009.00 2007.00 2015.00 2012.00 2012.00
actor_2_facebook_likes 936.00 5000.00 393.00 23000.00 632.00
imdb_score 7.90 7.10 6.80 8.50 6.60
aspect_ratio 1.78 2.35 2.35 2.35 2.35
movie_facebook_likes 33000.00 0.00 85000.00 164000.00 24000.00
In [97]:
# variance test
dfcluster.var()
Out[97]:
num_critic_for_reviews       1.529460e+04
duration                     5.130435e+02
director_facebook_likes      9.343422e+06
actor_3_facebook_likes       3.448648e+06
actor_1_facebook_likes       2.406899e+08
num_voted_users              2.286276e+10
cast_total_facebook_likes    3.627370e+08
facenumber_in_poster         4.166973e+00
num_user_for_reviews         1.680854e+05
title_year                   9.780129e+01
actor_2_facebook_likes       2.032344e+07
imdb_score                   1.110753e+00
aspect_ratio                 1.235851e-01
movie_facebook_likes         4.603982e+08
dtype: float64
In [98]:
# normalize the data!
df_norm = (dfcluster - dfcluster.mean()) / (dfcluster.max() - dfcluster.min())
df_norm.head()
Out[98]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 0.687300 0.231800 -0.034748 0.004200 -0.010455 0.462709 -0.010018 -0.031991 0.538094 0.067240 -0.007705 0.186679 -0.022317 0.068059
1 0.168188 0.201083 -0.010269 0.010504 0.050483 0.217115 0.056244 -0.031991 0.179059 0.044768 0.021959 0.082783 0.016145 -0.026497
2 0.538101 0.129411 -0.034748 -0.025974 0.005170 0.101502 0.000437 -0.008736 0.130819 0.134656 -0.011668 0.043822 0.016145 0.217056
3 0.798274 0.184018 0.921774 0.967026 0.030170 0.615476 0.145183 -0.031991 0.468304 0.100948 0.153346 0.264601 0.016145 0.443417
5 0.365475 0.074803 -0.014095 -0.009931 -0.011017 0.063825 -0.014526 -0.008736 0.080206 0.100948 -0.009924 0.017848 0.016145 0.042271
In [99]:
# variance test after normalization
df_norm.var()
Out[99]:
num_critic_for_reviews       0.023254
duration                     0.005976
director_facebook_likes      0.017662
actor_3_facebook_likes       0.006519
actor_1_facebook_likes       0.000588
num_voted_users              0.008008
cast_total_facebook_likes    0.000841
facenumber_in_poster         0.002254
num_user_for_reviews         0.006570
title_year                   0.012347
actor_2_facebook_likes       0.001083
imdb_score                   0.018734
aspect_ratio                 0.000563
movie_facebook_likes         0.003780
dtype: float64

Clustering analysis (k = 2): Include "random_state=0"

In [100]:
#two clusters

k_means = KMeans(init='k-means++', n_clusters=2, random_state=0)
k_means.fit(df_norm)
# clustering analysis with k = 2
Out[100]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)
In [101]:
#clustering results
k_means.labels_
Out[101]:
array([1, 1, 1, ..., 0, 0, 0], dtype=int32)
In [102]:
# find out cluster centers

k_means.cluster_centers_
Out[102]:
array([[-0.06187907, -0.01295932, -0.02092564, -0.01033934, -0.0022431 ,
        -0.03087563, -0.00357524,  0.00093598, -0.02525122, -0.01348886,
        -0.00399459, -0.03117228, -0.00153654, -0.01862146],
       [ 0.20817121,  0.04359726,  0.07039725,  0.03478321,  0.00754614,
         0.10387061,  0.01202768, -0.00314879,  0.08494918,  0.04537872,
         0.01343844,  0.10486861,  0.00516918,  0.0626456 ]])
In [103]:
# convert cluster lables to dataframe

df1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df1.head()
Out[103]:
cluster
0 1
1 1
2 1
3 1
4 1
In [104]:
# Look at the cluster breakdown
df1.groupby('cluster').size()
Out[104]:
cluster
0    2910
1     865
dtype: int64
  • Cluster 1 (0) = 2910
  • Cluster 2 (1) = 865
In [105]:
# join df_norm & df1

df2 = df_norm.join(df1)
df2.head()
Out[105]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
0 0.687300 0.231800 -0.034748 0.004200 -0.010455 0.462709 -0.010018 -0.031991 0.538094 0.067240 -0.007705 0.186679 -0.022317 0.068059 1.0
1 0.168188 0.201083 -0.010269 0.010504 0.050483 0.217115 0.056244 -0.031991 0.179059 0.044768 0.021959 0.082783 0.016145 -0.026497 1.0
2 0.538101 0.129411 -0.034748 -0.025974 0.005170 0.101502 0.000437 -0.008736 0.130819 0.134656 -0.011668 0.043822 0.016145 0.217056 1.0
3 0.798274 0.184018 0.921774 0.967026 0.030170 0.615476 0.145183 -0.031991 0.468304 0.100948 0.153346 0.264601 0.016145 0.443417 1.0
5 0.365475 0.074803 -0.014095 -0.009931 -0.011017 0.063825 -0.014526 -0.008736 0.080206 0.100948 -0.009924 0.017848 0.016145 0.042271 1.0
In [106]:
# What are the profiles for each cluster?
df2.groupby(['cluster']).mean() 
Out[106]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
cluster
0.0 -0.003868 0.002224 0.002485 0.000950 0.000530 -0.001534 0.000673 0.001244 -0.002870 0.002018 0.000420 -0.005401 0.000622 -0.002726
1.0 0.052102 0.018016 0.007270 0.011487 0.002081 0.028921 0.003996 0.000747 0.027279 0.017278 0.005476 -0.000992 0.003702 0.017669

This results shows "imdb_score", is NOT an important factors since about the equal number of each belong to clusters (as indicated by the fact that the mean of each in each cluster is almost same)

profiling

  1. cluster 0:
    • extremely low num_critic_for_reviews
    • low duration
    • low high director_facebook_likes
    • low actor_3_facebook_likes
    • low high actor_1_facebook_likes
    • extremely low num_voted_users
    • low cast_total_facebook_likes
    • high facenumber_in_poster
    • low num_user_for_reviews
    • low title_year
    • low actor_2_facebook_likes
    • low aspect ratio
    • extremely low movie_facebook_likes
  2. cluster 1:
    • extremely high num_critic_for_reviews
    • high duration
    • high high director_facebook_likes
    • high actor_3_facebook_likes
    • high high actor_1_facebook_likes
    • extremely high num_voted_users
    • high cast_total_facebook_likes
    • low facenumber_in_poster
    • high num_user_for_reviews
    • high title_year
    • high actor_2_facebook_likes
    • high aspect ratio
    • extremely high movie_facebook_likes

AgglomerativeClustering

In [107]:
# don't use the normlized data, but go back to the original clustered data
X = (dfcluster - dfcluster.mean()) / (dfcluster.max() - dfcluster.min())
X.head()
Out[107]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 0.687300 0.231800 -0.034748 0.004200 -0.010455 0.462709 -0.010018 -0.031991 0.538094 0.067240 -0.007705 0.186679 -0.022317 0.068059
1 0.168188 0.201083 -0.010269 0.010504 0.050483 0.217115 0.056244 -0.031991 0.179059 0.044768 0.021959 0.082783 0.016145 -0.026497
2 0.538101 0.129411 -0.034748 -0.025974 0.005170 0.101502 0.000437 -0.008736 0.130819 0.134656 -0.011668 0.043822 0.016145 0.217056
3 0.798274 0.184018 0.921774 0.967026 0.030170 0.615476 0.145183 -0.031991 0.468304 0.100948 0.153346 0.264601 0.016145 0.443417
5 0.365475 0.074803 -0.014095 -0.009931 -0.011017 0.063825 -0.014526 -0.008736 0.080206 0.100948 -0.009924 0.017848 0.016145 0.042271
In [108]:
np.random.seed(1) # setting random seed to get the same results each time.

agg= AgglomerativeClustering(n_clusters=4, linkage='ward').fit(X)
agg.labels_
Out[108]:
array([0, 0, 0, ..., 1, 3, 1])
In [109]:
plt.figure(figsize=(16,8))

linkage_matrix = ward(X)
dendrogram(linkage_matrix, orientation="left")
plt.tight_layout() # fixes margins
In [110]:
plt.figure(figsize=(16,8))

plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')

linkage_matrix = ward(X)
dendrogram(linkage_matrix, 
           #truncate_mode='lastp',  # show only the last p merged clusters
           #p=12,  # show only the last p merged clusters
           #show_leaf_counts=False,  # otherwise numbers in brackets are counts
           leaf_rotation=90.,
           leaf_font_size=12.,
           show_contracted=True,  # to get a distribution impression in truncated branches
           orientation="top")
plt.tight_layout() # fixes margins
In [111]:
plt.figure(figsize=(16,8))

plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')

linkage_matrix = ward(X)
dendrogram(linkage_matrix, 
           truncate_mode='lastp',  # show only the last p merged clusters
           p=4,  # show only the last p merged clusters
           #show_leaf_counts=False,  # otherwise numbers in brackets are counts
           leaf_rotation=90.,
           leaf_font_size=12.,
           show_contracted=True,  # to get a distribution impression in truncated branches
           orientation="top")
plt.tight_layout() # fixes margins
In [112]:
#To add cluster label into the dataset as a column
df1 = pd.DataFrame(agg.labels_, columns = ['cluster'])
df1.head()
Out[112]:
cluster
0 0
1 0
2 0
3 2
4 0
In [113]:
df2 = df.join(df1)
df2.head()
Out[113]:
director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name movie_title ... Romance Sci-Fi Short Sport Thriller War Western return_on_investment imdbscores_bins cluster
0 James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder Avatar ... 0 1 0 0 0 0 0 320.888543 4 0.0
1 Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp Pirates of the Caribbean: At World's End ... 0 0 0 0 0 0 0 103.134717 4 0.0
2 Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz Spectre ... 0 0 0 0 1 0 0 81.662929 4 0.0
3 Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy The Dark Knight Rises ... 0 0 0 0 1 0 0 179.252257 5 2.0
5 Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara John Carter ... 0 1 0 0 0 0 0 27.705225 4 0.0

5 rows × 51 columns

In [114]:
# What are the profiles for the df2 non normal data?
df2.groupby('cluster').mean()
Out[114]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews ... Musical Mystery Romance Sci-Fi Short Sport Thriller War Western return_on_investment
cluster
0.0 207.062500 114.614796 926.445153 964.125000 9220.557398 8.355683e+07 145976.428571 14169.973214 1.438520 451.746173 ... 0.021684 0.073980 0.218112 0.193878 0.000000 0.035714 0.283163 0.038265 0.017857 144.128025
1.0 159.300958 109.432969 838.695622 762.248290 7395.777702 4.679609e+07 96829.048564 10971.281122 1.368536 309.935705 ... 0.030780 0.115595 0.231190 0.111491 0.000684 0.043092 0.315321 0.044460 0.015732 185.265176
2.0 184.051852 117.333333 765.970370 1283.651852 8314.125926 7.231600e+07 146352.481481 13338.770370 1.703704 418.481481 ... 0.014815 0.103704 0.237037 0.155556 0.000000 0.022222 0.325926 0.081481 0.014815 132.941094
3.0 161.254321 110.156790 883.500000 778.349383 8981.100000 5.070294e+07 101081.879012 13065.872840 1.440741 310.109877 ... 0.013580 0.113580 0.245679 0.138272 0.000000 0.045679 0.309877 0.040741 0.014815 170.642317

4 rows × 40 columns

In [115]:
# Look at the cluster breakdown
df2.groupby('cluster').size()
Out[115]:
cluster
0.0     784
1.0    1462
2.0     135
3.0     810
dtype: int64
  • Cluster 1 (0) = 784
  • Cluster 2 (1) = 1462
  • Cluster 3 (2) = 135
  • Cluster 4 (3) = 810
In [116]:
sns.lmplot("cluster", "movie_facebook_likes", df2, x_jitter=.15, y_jitter=.15)
Out[116]:
<seaborn.axisgrid.FacetGrid at 0x1c29525320>
In [117]:
sns.lmplot("cluster", "duration", df2, x_jitter=.15, y_jitter=.15)
Out[117]:
<seaborn.axisgrid.FacetGrid at 0x1c2811d860>
In [118]:
sns.lmplot("cluster", "num_voted_users", df2, x_jitter=.15, y_jitter=.15)
Out[118]:
<seaborn.axisgrid.FacetGrid at 0x1c28a20780>

Interpretation of Clustering Analysis

  • When you answer the following questions below, use the original data (not the normalized data) for better interpretability
In [119]:
# join df & df1

df3 = dfcluster.join(df1)
df3.head()
Out[119]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
0 723.0 178.0 0.0 855.0 1000.0 886204 4834 0.0 3054.0 2009.0 936.0 7.9 1.78 33000 0.0
1 302.0 169.0 563.0 1000.0 40000.0 471220 48350 0.0 1238.0 2007.0 5000.0 7.1 2.35 0 0.0
2 602.0 148.0 0.0 161.0 11000.0 275868 11700 1.0 994.0 2015.0 393.0 6.8 2.35 85000 0.0
3 813.0 164.0 22000.0 23000.0 27000.0 1144337 106759 0.0 2701.0 2012.0 23000.0 8.5 2.35 164000 2.0
5 462.0 132.0 475.0 530.0 640.0 212204 1873 1.0 738.0 2012.0 632.0 6.6 2.35 24000 0.0

How many observations are there in cluster 1, 2, 3 and 4?

In [120]:
# Look at the cluster breakdown
df3.groupby('cluster').size()
Out[120]:
cluster
0.0     788
1.0    1467
2.0     137
3.0     822
dtype: int64
  • Cluster 1 (0) = 788
  • Cluster 2 (1) = 1467
  • Cluster 3 (2) = 137
  • Cluster 4 (3) = 822

The mean values of each cluster in terms of different variables

In [121]:
# Look at the profiles for each cluster.
df3.groupby('cluster').mean()
Out[121]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
cluster
0.0 207.796954 114.890863 958.189086 962.052030 9207.460660 147638.276650 14146.435279 1.438832 455.814721 2004.804569 2771.955584 6.457741 2.168985 15005.543147
1.0 159.950239 110.000000 845.344922 765.042263 7393.577369 99334.156783 10994.102249 1.370416 315.395365 2003.035446 1876.546694 6.443081 2.109277 7830.524199
2.0 186.248175 117.109489 757.452555 1269.948905 8202.846715 149035.197080 13170.452555 1.678832 426.503650 2004.240876 2528.277372 6.507299 2.164672 11473.240876
3.0 160.940389 111.218978 895.246959 772.669100 8915.480535 101205.673966 12965.709246 1.463504 310.811436 2003.002433 2216.335766 6.362895 2.126180 8659.684915

What is the profile of each cluster?

This results shows "title_year", "imdb_score", "aspect_ratio" are NOT an important factors since about the equal number of each belong to clusters (as indicated by the fact that the mean of each in each cluster is almost same)

profiling

  1. cluster 0:
    • extremely high num_critic_for_reviews
    • high duration
    • extremely high director_facebook_likes
    • high actor_3_facebook_likes
    • extremely high actor_1_facebook_likes
    • high num_voted_users
    • high cast_total_facebook_likes
    • moderate high facenumber_in_poster
    • high num_user_for_reviews
    • high actor_2_facebook_likes
    • high movie_facebook_likes
  2. cluster 1:
    • low num_critic_for_reviews
    • 1ow duration
    • high high director_facebook_likes
    • low actor_3_facebook_likes
    • low actor_1_facebook_likes
    • extremely low num_voted_users
    • low cast_total_facebook_likes
    • low facenumber_in_poster
    • low num_user_for_reviews
    • low actor_2_facebook_likes
    • extremely low movie_facebook_likes
  3. cluster 2:
    • high num_critic_for_reviews
    • high duration
    • extremely low director_facebook_likes
    • extremely high actor_3_facebook_likes
    • low actor_1_facebook_likes
    • high num_voted_users
    • high cast_total_facebook_likes
    • extremely high facenumber_in_poster
    • high num_user_for_reviews
    • high actor_2_facebook_likes
    • high movie_facebook_likes
  4. cluster 3:
    • low num_critic_for_reviews
    • 1ow duration
    • high director_facebook_likes
    • low actor_3_facebook_likes
    • high actor_1_facebook_likes
    • low num_voted_users
    • low cast_total_facebook_likes
    • moderately high facenumber_in_poster
    • low num_user_for_reviews
    • low actor_2_facebook_likes
    • low movie_facebook_likes
In [122]:
df3.groupby('cluster')['director_facebook_likes'].mean()
Out[122]:
cluster
0.0    958.189086
1.0    845.344922
2.0    757.452555
3.0    895.246959
Name: director_facebook_likes, dtype: float64
In [123]:
sns.lmplot("cluster", "director_facebook_likes", df3, x_jitter=.15, y_jitter=.15);

We can see that cluster 0, 1 and 3 have the higher diretor_facebook_likes

Storytelling

  • What did we learn in this project?

Project
The goal of this project was to identify what columns effected IMDB_SCORE. The following were positive factors.

  • Recent title year
  • Director facebook likes
  • Duration
  • Movie Facebook likes
  • Number of voters
  • Lower Duration

Data
Let's talk about the data we removed and cleaned up.

  • We removed color.
  • We split the genres into dummy columns.
  • We split imbd_score and movie_facebook_likes into bins.
  • We removed duplicate rows.
  • We removed language.
  • We created return on investment.
  • We removed rows with null values in certain columns.

We replaced values in several column with the mean () for that column.

  • Num_critic_for_reviews
  • Duration
  • Actor_1_facebook_likes
  • Actor_2_facebook_likes
  • Actor_3_facebook_likes
  • Facenumber_in_poster
  • Aspect_ratio

Regression
Did the regression show results that made sense?

  • Higher facebook likes have a positive correlation to imdb_score.
  • This was in line with our analysis on the mid-term.
</dl>

Classification

Our category breakdown was interesting.

  • imdb_category
  • 4 (bad) 95
  • 6 (OK) 1055
  • 8 (good) 2467
  • 10 (excellent) 158

Based on the data above, the best model is "KNN"

  • Decision Tree : 65%
  • KNN : 66%
Clustering

Normalized clustering
  • df2 clustering shows the cluster one was low overall, while cluster 2 is high.

Lessons
What are the top things we learned from this project?

  • It appears that the more recent movies have a positive correlation to imdb_score.
  • Any movie that attracts social media is apt to have a higher imdb_score.
  • If you have high facebook likes, you do well, and tend to do better with critics for reviews available.
  • A majority of our imdb_scores fell in the good category, with OK coming in second.
  • In order to better predict movie success, I would also like to see the following.
  • ----- Theater type, City released, Leader actor gender
</dl>